How accurate can manual review be?

One of the chief pleasures for me of this year's SIGIR in Beijing was attending the SIGIR 2011 Information Retrieval for E-Discovery Workshop (SIRE 2011). The smaller and more selective the workshop, it often seems, the more focused and interesting the discussion. My own contribution was "Re-examining the Effectiveness of Manual Review". The paper was [...]

Assessor disagreement and court sanctions

I mentioned Cross and Kerksiek's suggestion of vocabulary discovery in my previous post. Their paper also contains an interesting reference to a case (Felman Products, Inc. v. Industrial Risk Insurers) in which the defendant was penalized for the carelessness of their production. The defendant inadvertently produced privileged documents, and sought to have them [...]

Corpus characterization in e-discovery

In e-discovery (document retrieval for civil litigation), one side has the documents, the other side proposes the query. This creates an information asymmetry; the requesting side cannot view the corpus to decide what keywords to use and what queries to propose, and opportunities for query iteration are limited, expensive, and liable to being contested.

What [...]

What [...] → Read More: Corpus characterization in e-discovery


Harvard researcher and open-access advocate, Aaron Swartz, faces 35 years' jail for downloading 4.8 million articles from JSTOR. Still relaxed and comfortable about publishing in closed-access journals?
Correct spelling and grammar more important than positivity or negativity of product reviews — Panos Ipeirotis.
Fitting an elephant with four parameters.
Placebos as effective as real medicine in improving [...]

Multiple significance tests in IR

At the most recent TIGER reading group, Mark Sanderson presented Bland and Altman's introduction to multiple significance tests and the Bonferroni method. The basic point is simple: if you keep trying different experiments, and testing each for significance, then eventually you will find significance by chance, even where no real effect exists. Therefore, [...]

PhD Thesis

My PhD thesis was passed (actually a few months ago), and I've placed it online. The core of the research material has been published elsewhere, but there are a few updates:

Score standardization (Chapter 4): The chapter on standardization combines Score Standardization for Robust Comparison of Retrieval Systems (ADCS 2007) and Score Standardization for [...]

ESSIR 2011: Top-K reasoning for NOT attending the European Summer School on IR

ESSIR 2011 – European Summer School in Information Retrieval 29 Aug – 02 Sep 2011, Koblenz, Germany http://essir.uni-koblenz.de/ Practical top-k reasoning in Information Retrieval: Top-10 reasons NOT to attend ESSIR 2011 If you don't want to meet auth…

Publicity about lecture webcasting

Interest in lecture webcasting has been picking up. As noted in earlier blog posts, TalkMiner has received some good publicity recently. Now, I am mentioned in a U.S. News & World Report article about the webcasting system I built at U.C. Berkeley.

How often should statistical significance occur?

Via Andrew Gelman and Howard Wainer, an interesting meta-analysis from 2005 by Pan, Trikalinos, Kavvoura, Lau, and Ioannidis (the last of Why most published research findings are false fame), comparing reported statistical significance and effect sizes in studies of genetic propensity to disease, between studies performed in mainland China and those performed elsewhere. There [...]

Renewing ACM

My ACM membership came due just recently. In the light of objections to their copyright policy, I seriously considered not renewing in protest. I agree with Panos that the ACM should not seek copyright from authors, unless the authors are actually being paid for their work. I'm particularly struck by Bob Carpenter's [...]