There has been a flurry of interest the past couple of days over Judge Miller’s order in re Biomet. In their e-discovery process, the defendants employed a keyword filter to reduce the size of the collection, and input only the post-filtering documents to their vendor’s predictive coding system, which seems to be a frequent practice [...] → Read More: What is the maximum recall in re Biomet?
Point- and lower-bound confidence estimates on the completeness (or recall) of an e-discovery production are calculated by sampling documents, from both the production and the remainder of the collection (the null set). The most straightforward way to draw this sample is as a simple random sample (SRS) across the whole collection, produced and unproduced. However, [...] → Read More: Stratified sampling in e-discovery evaluation
Readers of Ralph Losey’s blog will know that he is an advocate of what he calls “multimodal” search in e-discovery; that is, a search methodology that involves a mix of human-directed keyword search, human-machine blended concept search, and machine-directed text classification (or predictive coding, in the e-discovery jargon). Meanwhile, he deprecates the alternative model of [...] → Read More: Does automatic text classification work for low-prevalence topics?
A question that often comes up when discussing e-discovery validation protocols is, why should they be based on confidence intervals, rather than point estimates? That is, why do we say, for instance, “the production will be accepted if we have 95% confidence that its recall is greater than 60%”? Why not just say “the production [...] → Read More: Why confidence intervals in e-discovery validation?
As it is becoming apparent that, without drastic immediate action, we are going to significantly overshoot greenhouse gas emission targets and warm the planet by an environmentally disastrous 4 to 5 degrees centigrade by the end of the century, I thought I should fulfil my long-standing promise to myself and calculate the carbon emissions generated [...] → Read More: The environmental consequences of SIGIR
Hi FXPAL blogosphere. Among the odds and ends I do at FXPAL is help people present their works with video. It also falls to me to archive the videos themselves. As I periodically move the video to new storage servers, I tend to look over “the old family album.” Our family is in the business [...] → Read More: Mining the Video Past of Future Research: Is it worth a look?
My last post introduced the idea of the satisfiability of a post-production quality assurance protocol. We said that such a protocol is not satisfiable for a given size of the sample from the unretrieved (or null) set if the protocol were to fail the production even if the sample found no relevant documents. The reason [...] → Read More: Statistical power of E-discovery validation
In my last post, I examined the live-blog e-discovery production being performed by Ralph Losey, and asked what lower limit we could place on the recall of highly relevant documents with 95% confidence based on the final, quality assurance sample. The QA sample drew 1065 documents from the null set (that is, the set of [...] → Read More: Meaningful QA sample size in e-discovery
Those who are following Ralph Losey’s live-blogged production of material on involuntary termination from the EDRM Enron collection will know that he has reached what was to be the quality assurance step (though he has decided to do at least one more iteration of production for the sake of scientific verification). Quality assurance here involves [...] → Read More: Quality assurance samples and prior beliefs
In my last post, I discussed an experiment in which we had two assessors re-assess TREC Legal documents with less and more detailed guidelines, and found that the more detailed guidelines did not make the assessors more reliable. Another natural question to ask of these results, though not one the experiment was directly designed to [...] → Read More: Do document reviewers need legal training?