# Repeated testing does not necessarily invalidate stopping decision

GD Star Rating

Thinking recently about the question of sequential testing bias in e-discovery, I’ve realized an important qualification to my previous post on the topic. While repeatedly testing an iteratively trained classifier against a target threshold will lead to optimistic bias in the final estimate of effectiveness, it does not necessarily lead to an optimistic bias in the stopping decision.

First, let me explain what I mean by “optimistic bias in the stopping decision” (as this is not standard statistical terminology). A stopping decision is invoked when a sample-based estimator says that some threshold condition has been passed. For instance, our stopping condition may be (1) that the point estimate of recall passes 80%, or (2) that our 95% lower bound confidence bound on recall passes 75%. What I am calling “optimistic bias in the stopping decision” is a situation where the stopping condition is “on average” violated; where the mean recall is lower than 80% for the former stopping condition, or recall is below 75% more than 5% of the time for the latter. (What we are “averaging” across here is a notional repeated performance of an identical training and testing setup, but with different random samples each time.)

Now, whether there is an optimistic bias in the stopping decision depends upon how often and when we test (as well as whether the test set is fixed, growing, or being replaced at each test, and also the shape of the classifier’s true learning curve). To take an obvious extreme, we will never stop optimistically if we don’t start testing until after true classifier effectiveness is greater than the threshold. And, almost as obviously, we won’t stop optimistically on average if our tests are very far apart, with the first test when the classifier is well below target effectiveness (in which case the probability of “accidentally” passing the threshold through an unlucky sample is very small), and the next one after target threshold.

It is also true (though it takes some thinking to see it) that we will not necessarily stop optimistically on average where tests are far apart, even if the first test is made when true effectiveness is just below the threshold. Say that this first test falls at classifier recall of 79%, and that there is a 49% chance of (erroneously passing), whereas the second test falls at classifier recall of 90%, and (to keep things simple) will always pass. Then our expected recall at stopping is $0.79 * 0.49 + 0.9 * (1 - 0.49) = 0.802$—above our threshold.

We can get an intuition for this effect graphically. As previously, I assume an unrealistic linear learning curve, and that errors are normally distributed and independent between tests (the latter of which would occur only if a new sample were drawn for each test). This time, though, I do derive the standard error from the true recall Z, using the standard binomial formula of $\sqrt{Z (1 - Z) / (n - 1)}$, where the sample size of relevant documents $n = 100$. (I show the inter-quartile and 95% ranges of the sampling distribution.)

Train till estiamated 72.5% recall, testing every 25 documents

Train to estimated 72.5% recall, testing every 100 documents

Train to estimated 72.5% recall, testing every 200 documents

Three testing scenarios are shown, one in which we test every 25 training documents (top); a second where we test every 100 training documents (middle); and a third where we test every 200 training documents (bottom). In each case, the stopping criterion is an estimated recall of 72.5%. The more frequently we test, the more likely we are to see an above-threshold estimated recall before actual recall has reached threshold. If we test infrequently enough, though, actual recall will on average be above threshold when we stop, as can be seen from the below table:

 Test frequency Mean achieved recall 25 50 100 200 69.4% 71.5% 73.9% 77.2%

It is far from straightforward, however, to say for any particular testing setup whether it has an optimistic decision bias to it. Sampling size, design, and frequency matters, but so to does the shape of the true learning curve around the stopping threshold. If the learning curve is rising steeply (that is, the classifier is improving quickly with additional training data) around the recall threshold, then optimistic decision bias is less likely; if the learning curve is nearing the plateau point, however, optimistic decision bias becomes more likely.

Nevertheless, we can make two general points.

1. Repeated testing does not necessarily invalidate an e-discovery
testing regime.
2. The less frequently the testing is performed, the less likely
there is to be an optimistic decision bias.