Rating

    No results.

Reverted indexes and true relevance feedback

GD Star Rating
loading...

I was fortunate enough at CIKM not only to meet the two bloggers cited in my thesis, namely Gene Golovchinsky and Jeremy Pickens of FXPAL (the latter now of Catalyst), but also to hear Jeremy present one of the more interesting papers of the conference, “Reverted Indexing for Feedback and Expansion” (co-authored with Gene and FXPAL’s Matthew Cooper; the author’s version is available for free download).

Gene gives an extended summary of the paper on the FXPAL blog. In brief, a reverted index maps from documents to the queries which retrieve those documents, whereas an inverted index maps from terms to the documents those terms occur in. Construction re-uses the infrastructure of the inverted index: each query is issued to the index, and result converted into a pseudo-document containing the docids of the top n = 1000 documents retrieved, with each docid repeated between 1 to 10 times, depending on its similarity score with the query. The resulting pseudo-corpus is then inverted, giving the document to query mapping.

Among the uses of a reverted index, and the use the authors put it to in their paper, is relevance feedback, where it identifies query concepts cognate to the feedback documents. The queries the authors use to construct the reverted index tested in their paper are simply the individual terms of the index vocabulary. Using this single-term reverted index, the authors achieve significantly improved effectiveness in true relevance feedback over strong baselines. In this post, I want to explore how this improved effectiveness is achieved.

The authors surmise that the reverted index is more effective because it suggests more selective expansion terms, and they reproduce example term sets as evidence. This explanation is convincing enough as far as it goes; but what is not explained is why the reverted index’s expansion terms are more selective. The reason is not obvious. A single-term reverted index is not much more than a weighted direct index, mapping from documents to the terms that occur in them. Nor do the resulting term weights favour more selectivity. The (1, 10) normalization applied during reversion effectively removes their original IDF factor; and the maximum DF for terms coming out of the reverted index is capped at the n = 1000 result cutoff. If anything, reverted weights seem likely to reduce selectivity.

The reason for the enhanced selectivity of the reverted index is not, I think, the way expansion term weights are derived. Rather, the key is the n=1000 cutoff for the input retrieval results. The cutoff excludes from the reverted index altogether the great majority of occurrences of highly frequent, underly selective terms. It is not that selective terms are given more weight in the expansion; it is that unselective terms are not considered for expansion at all. This is suggested by a count of the number of top-15 expansion terms with a DF above 1000 across a sample of queries for each retrieval method, given in the table below and derived from Table 1 of the paper. The reverted index (RevPL2) suggests almost no such terms, and most of the ones that it does suggest were in the original query.

  Method
Topic Id Summary KL Bo1 RevPL2
172 “quit smoking” 11 10 2
341 “airport security” 11 10 2
407 “poaching wildlife” 7 7 3
419 “recycle tires” 10 10 1

The above discussion is in no way a criticism of the paper; the reverted index is a very clever and interesting idea, and it certainly enables a highly effective form of true relevance feedback. I would, though, be interested to see a follow-up study (or hear of existing studies) that directly addressed the question of excluding high-frequency expansion terms based on a fixed cutoff, and compared it with the seemingly more general approach of enhancing the IDF component of expansion term weighting.

LinkedInShare

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>