No results.

The session track of TREC

GD Star Rating

This year’s TREC conference had several interesting sessions, and not the least interesting of them were the planning sessions for next year’s tracks. The design of a collaborative retrieval task, and of the methods and measures for evaluating such a task, can provoke a more wide-ranging, philosophical discussion than the presentation of retrieval results, or the description of a narrowly technical research outcome.

One such interesting design discussion occurred in the planning meeting for the Session Track, which is entering its second year at TREC, under the direction of Ben Carterette and Evangelos Kanoulas (EDIT 20/12/2010: and Paul Clough and Mark Sanderson). As Ben and Evangelos expressed it in their overview, evaluation in IR is bifurcated into two extremes: the once-off, contextless, test collection evaluation of the Cranfield tradition, narrow but repeatable; and the open, flexible, but expensive and non-repeatable evaluation of user studies. The goal of the Session Track is to bridge the system–user gap, starting in small steps from the Cranfield end of the spectrum. Specifically, the track aims to evaluate retrieval in response not simply to an ad-hoc, contextless user query, but to a query that takes place in the context of a session; and, what is more difficult, to design the evaluation and testset in a way that is repeatable and reusable.

The approach in the first year was to provide participants with not just a single query for each information need, but also with a follow-up reformulation of that query: a generalization, a specialization, or else parallel or drifting reformulation. The track participant is then to produce three result sets: a response to the original query; a response to the reformulated query in isolation; and then a response to the reformulated query in light of the original. Evaluation similarly involves assessing the two runs independently, and then in combination. Combined evaluation uses a metric such as session nDCG; one which assesses the second run in terms of what it adds to the results of the first one.

As was pointed out at the results summary session for the track, this evaluation methodology has some problems. One problem was observed in the run report of my colleague at the University of Melbourne, Vo Ngoc Anh. In developing his run (which I was not involved with), he hypothesized that the user reformulated their query because they were disatisfied with the results of the original formulation. Therefore, in Anh’s submission, documents returned in the first result (including below the top ten) were actively deprecated in the second. As it turned out, Anh’s first run was at or near the strongest in amongst the participants, but his second and combined runs demonstrated a sharp drop in effectiveness. In the event, not only the assessment of relevance, but also the construction of the session nDCG metric, ran precisely contrary to his assumptions: assessor and metric evaluated the second run not as an alternative to, but a continuation of the first. But if that were the case, why would the user reformulate?

Contradictions in the treatment of the second response are emblematic of a broader problem with the design of the track as it stands. This problem is that the reformulation of the second query is independent of the results retrieved to the first. The response to the first query could be gibberish, or it could be an excellent answer to that first query; the retrieval system could say “I’m a teapot, I’m a teapot, I’m a teapot”, over and over; it could even provide in advance an oracular response to the query’s reformulation; no matter: the reformulated query will be asked regardless.

Two solutions to the problem of an invariant refomulated query were proposed. The first was a rather ambitious plan to calculate a probability distribution over possible reformulations given the response to the original. The second solution was much more straightforward: capture original query, response, and reformulation, then have the test system response to the reformulation in the context of the original query and its canned response. This method can readily be extended to requiring a response to the nth query given queries 1 through n – 1 and their responses.

More interesting than the particular proposals for the next iteration of the track, though, was the debate over the value of the track itself. Two main objections were raised by the audience. The first was that even with the addition of the context of original query and response, the setup failed to capture enough of the richness of a true user’s context; and, by extension, that richer retrieval environments simply couldn’t be captured by a test collection. And the second objection was that the session track was trying to solve, with very limited and data, a problem that search engines were tackling with much more resources and data, and presumably with some success.

Despite the force of these objections, the Session Track still seems to me an interesting and worthwhile project. Evaluation by test collection is deeply ingrained in the community, for both good and not so good reasons, and as a result it is frequently the case that the test set that frames a problem comes first, while serious consideration of the problem follows later. An example of this is the recent, and overdue, interest in result diversification. The idea that the best response to an ambiguous query is a diversity of results was pointed out long ago, and is obvious enough once raised; but it took the introduction of the Diversity Task of the Web Track and its data set to really concentrate the community’s attention on the problem. Even so seemingly small a change in evaluation setup has led to a fruitful re-evaluation of a range of existing questions in IR, from evaluation metrics to the use of pseudo-relevance feedback, from query difficulty prediction to topic and sense identification. (Indeed, a good way of thinking of new research ideas is to consider, “what existing results need to be reconsidered in the context of query diversity?”) In addition, history has shown that well-formed test collections are employed for an enormous variety of tasks beyond that which they were originally designed for. The session track, however limited its aims may seem in anticipation, and however difficult its task of capturing user context, has the potential for a similar profound impact on the field.


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>