No results.

Why 95% +/- 2% makes little sense for e-discovery certification

GD Star Rating

It is common in e-discovery protocols to see a requirement that the production be certified with a “95% +/- X%” sample (where “X%” takes on values such as “2%” or “5%”), leading to a required sample size being specified up front. (See, for instance, the ESI protocol that was recently debated in the ongoing Da Silva Moore case.) This approach, however, makes little sense, for two reasons. First, it specifies an accuracy in our measure, when what we want to specify is some minimal level of performance. And second, decisions about sample size and allocation should be delayed until after the (candidate) production is ready, when they can be made much more efficiently and effectively.

Before discussing those two assertions in more detail, let’s unpack what is going on with the “95% +/- 2%” specification. What this is saying is that we will draw a sample which will set a confidence interval no wider than 4% end to end (+/- 2%, though in practice the interval is not always symmetric around the point estimate). For simple random samples to estimate the prevalence (or richness) of relevant documents in document set, the maximum width of the interval can be determined from the sample size, and occurs when the point estimate is 50%. We are therefore saying that our widest interval will be [48%, 52%]. From this, we can figure out the sample size required: for +/- 2% and an exact binomial confidence interval, the sample size is 2,399.

Prevalence is not what we’re really interested in with certification samples, however; rather, we care about recall. For estimating recall from a simple random sample, what matters is not the size of the full sample, but the number of relevant documents that happen to fall in the sample. We therefore need some basis for estimating corpus prevalence, such as the initial control sample (used internally by the producing party to guide their production). Or alternatively, we can keep sampling until we achieve the desired number of relevant documents. In the latter case, we end up with a slightly biased estimator; in the former, we can only offer a probabilistic guarantee on the maximum width (since our estimate of corpus prevalence might be off).

What is wrong with this approach? First of all, we are guaranteeing the wrong thing. The accuracy of our measure is only a means to the end: the end is certifying that the production meets some lower bound on recall (with some probabilistic degree of confidence). Undertaking to accurately measure production recall is of little value if we end up with a very accurate certification that we have a very lousy production. And even the accuracy that we are measuring is likely to be at the wrong point, since it is unlikely that 50% recall is going to be our minimum threshold on performance. Instead, we should be certifying something of the form of “95% confidence that production recall is at least 65%”, and designing our sampling strategy to maximize accuracy of measure at the specified recall threshold.

The second problem with the 95% +/- 2% approach is that it commits us to the details of a sample design before the production has been made, and even before review proper has commenced. We are designing blind, and therefore our design will be suboptimal. (In truth, even for the 95% +/- 2% case, we can’t make the decision completely blind, since as mentioned we need a reasonable estimate of corpus prevalence in order to select sample size. But even this information is likely to be unreliable before serious review has commenced.) Instead, we should wait until we know what the (candidate) production set will be, and also have a basis for estimating not just overall corpus prevalence, but prevalence in the production and the null sets. With this knowledge, we can design a much more efficient certification sampling protocol, one that requires far fewer documents to be sampled and annotated in order to achieve the same level of reliability. One approach to drawing such a sample is stratified sampling; and I hope shortly to give a worked example showing just how dramatic the savings can be.

What this means in practice is that a sample size cannot be specified in an ESI protocol, at least not one agreed between the parties prior to review (when of course it should be agreed). Instead, what should be specified (in combination with other considerations such as cost, proportionality, and quality of the production process itself) is that a certain level of recall will be achieved, and certified to a certain level of confidence, with the producing side undertaking to propose a certification sampling design to achieve this post-production. If the producing side baulks at this as too strong a commitment, then the next best step is that a fixed sample size be agreed to, but that the sample design itself wait to be agreed between the two parties based upon the evidence of the production. Selecting this sample size will depend on heuristics and experience; but then the 95% +/- 2% approach, for all its seeming exactness, involves committing the parties to an inefficient certification process, in exchange for the wrong guarantees.


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>