No results.

Assessor error in legal retrieval evaluation

GD Star Rating

Another year, another CIKM. This marks my first post-PhD publication (I finally submitted!), and it also marks a new sub-genre of retrieval evaluation for me: that of legal retrieval, or more specifically e-discovery. Discovery is a process in which party A produces for party B all documents in party A’s position that are responsive to a request or series of requests lodged by party B. These requests most saliently take part during civil litigation, where party A is gathering evidence to sue party B, but they can also occur as part of governmental regulatory oversight, and I guess more broadly whenever one party has the legal right to access information in the possession of a second, potentially adversarial party. E-discovery is the extension of the discovery process to electronically stored information, which brings with it on the one hand a greatly increased volume of documents, while in compensation on the other hand the potential to use automatic tools as part of the production process.

Retrieval in e-discovery is very different from retrieval through web search. Web searches are generally the work of a single person, are frequently ephemeral, addressing an information need that is hardly ever explicated and often is inchoate in the user themselves, and ending in the subjective satisfaction or frustration of the user. E-discovery, on the other hand, is a process involving dozens of experts working for several interested or supervisory parties, taking place over many weeks, frequently costing millions of dollars, and performed in an explicit, negotiated, and documented way. Additionally, while the web searcher is typically satisfied with one or a handful of documents providing the particular information they were after, e-discovery aims to find (within reason) all documents that are related to a case. In the traditional jargon of retrieval evaluation, web search is precision-centric, e-discovery recall-centric.

E-discovery, then, is like the mediated, formalized, “serious” information retrieval of the good old days writ large. And besides it attractions to the nostalgic, e-discovery is a very sympathetic field for investigations into retrieval evaluation. Whereas success in web search is subjective, neither formally measured by the user nor definitively remarked by the search engine, in e-discovery, measuring the success of the retrieval process is an integral part of the process itself, one that is being increasingly stressed by case law. Therefore, the techniques developed in the experimental, collaborative or laboratory evaluation of e-discovery should inform the quality assurance and certification methods of e-discovery practice.

The contribution we are offering at CIKM this year concerns measuring assessor error, particularly in the Legal Track of TREC. An e-discovery process is supervised by a senior attorney, whose conception of relevance is authoritative. The role of the senior attorney is played in the track’s interactive task by a topic authority (TA), themselves a practicing attorney in real life. Actual relevance assessment is performed by multiple volunteer assessors, but the assessors’ role is to apply the TA’s conception of relevance, not their own. Therefore, whereas other evaluation fields talk of assessor disagreement, in the legal track we can talk of assessor error. Moreover, this assessor error is directly measurable, by referring assessor results for adjudication by the TA, something which is currently done via a participant-instigated appeals process. And in an environment where we care about absolute, rather than merely relative, measures of system performance, and particularly of system recall, assessor error can be seriously distorting to our evaluation outcomes — much moreso generally than the better-understood effects of sampling error.

The presence of the TA, the parallel assessment of randomly partitioned document bins by different assessors, and the outcomes of the appeals process, all produce what seems like a wealth of evidence for measuring and correcting assessor error. As we show in our paper, however, none of this evidence is in a form that is directly useful to us in correcting our measures of absolute performance (or at least we can’t see how it is; you are more than welcome to try your hand at it yourself). Instead, we propose that a double sampling approach be used, the theory of which we describe, and indicative results of which we produce — empirical work remains to be done on the application of double sampling to a complete retrieval evaluation.

On a more general note, it has become customary for me when describing conference papers to animadvert on the bizarreness of conferences as a mechanism for transacting academic business in the internet age. The death-march grind of CIKM does not begin till tomorrow, so I should keep my powder dry. I will only say at this point that while Toronto seems on brief inspection a very attractive city, no-one should have to go through Los Angeles airport in order to have their research ideas heard.


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>