No results.

Non-authoritative relevance coding degrades classifier accuracy

GD Star Rating

There has been considerable attention paid to the high level of disagreement between assessors on the relevance of documents, not least on this blog. This level of disagreement has been cited to argue in favour of the use of automated text analytics (or predictive coding) in e-discovery: not only do humans make mistakes, but they may make as many as or more than automated systems. But automated systems are only as good as the data used to train them, and production managers have an important choice to make in generating this training data. Should training annotations be performed by an expert, but expensive, senior attorney? Or can it be farmed out to the less expensive, but possibly less reliable, contract attorneys typically used for manual review? This choice comes down to a trade-off between cost and reliability—though ultimately reliability itself can be (at least partly) reduced to cost, too. The cost question still needs to be addressed; but Jeremy Pickens (of Catalyst) and myself have made a start on the question of reliability in our recent SIGIR paper, Assessor Disagreement and Text Classifier Accuracy.

The basic question that Jeremy and I ask is the following. We have two assessors available, one whose conception of relevance is authoritative, and the other whose isn’t (one could think of these as the senior and the contract attorney respectively, though we are addressing primarily the question of assessor disagreement, not expertise). The effectiveness of the final production is to be measured against the authoritative conception of relevance. The predictive coding system (or text classifier) can be trained by annotations either from the authoritative or from the non-authoritative assessor. What (if any) loss in classifier effectiveness is there from training with the non-authoritative assessor, compared to training with the authoritative one?

Working with TREC (non-ediscovery) data that has been used in past research on assessor disagreement, we find that training with the non-authoritative assessor does indeed lead to lower effectiveness than with the authoritative one. For the dataset examined, binary F1 score was 0.629 in the latter case, against 0.456 in the former. The loss of effectiveness from assessor disagreement, and from using a machine classifier rather than the assessor directly, are additive: annotation by the non-authoritative assessor then automated review leads to lower effectiveness than either non-authoritative human review, or automated review with authoritative training examples (though the additive degradation is not quite as great as the level of human disagreement would predict, based on randomization experiments).

What does this mean in terms of overall production effort? A common pattern in e-discovery is to have two rounds of human involvement: one to train the classifier; another to review the classifier’s positive predictions before production. Poor performance by the classifier can therefore (to some extent) be compensated by extending the review further down the prediction ranking that the classifier produces, to bring recall up to the required level; in this way, poor reliability becomes converted into extra cost. Our experiments found that increase in review depth to achieve a recall target of 75%, from using a non-authoritative trainer, was on average 24%; but for one in eight tasks, the required review depth doubled.

These results do not directly answer the question of cost. One may come out ahead overall even with the deeper review, due to the savings from using a cheaper trainer—though this largely assumes that the reviewers are also cheaper contract attorneys (which rather raises the question of the reliability of the review itself). Also, the experimental data and setup are more than a little removed from those of an actual e-discovery production, so the precise numerical results may not be directly translatable. Nevertheless, we have demonstrated that who does the training can have a substantial effect on how reliable the (machine) trainee. As other commentators have urged, consideration of these human factors needs to be a central part of ESI protocol design.

As an appendix, I provide a couple of figures that did not make it into the published paper, but which speak to the relative reliability of non-authoritative training and automated review.

F1 of inter-assessor agreement compared with non-authoritative training

F1 of inter-assessor agreement compared with non-authoritative training

The first of these figures, above, shows the F1 score of inter-assessor agreement (on the X axis), compared to the F1 score of the classifier when trained using the non-authoritative trainer (“cross-classification”, in the jargon of the paper, on the Y axis; this is equivalent to the Y axis of Figure 2 of the paper). Each data point represents one of the 2 alternative trainers on each of the 39 topics included in our experiments. We can see that the classifier with non-authoritative training is substantially less accurate than inter-assessor disagreement alone.

F1 of inter-assessor agreement compared to F1 of classifier trained by authoritative assessor.

F1 of inter-assessor agreement compared to F1 of classifier trained by authoritative assessor.

The second of these figures, above, compares the agreement with the authoritative assessor of the non-authoritative assessor (on the Y axis), and of the classifier trained by the authoritative assessor (“self-classification” in our jargon, on the X axis). This figure is essentially asking the same question as Grossman and Cormack, namely whether predictive coding (with an authoritative annotator) is as reliable as (non-authoritative) human review; and, as with Grossman and Cormack, we find (as far as our dataset is able to answer this question) that the answer is, on average, yes, though with enormous variation between topics and assessors. Indeed, the above figure may understate the accuracy of the machine classifier, since training set size was limited to the relatively small amount available in the dataset, and substantially larger training data sets would be standard in e-discovery—though again many other aspects of the data and process would be different as well.


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>