No results.

Reverted indexes considered further

GD Star Rating

Gene and Jeremy have disputed two assertions about index reversion made in my previous post. The first is that (1,10) normalization strips term occurences of their IDF factor. The second is that the cutoff depth used in creating the reverted corpus sets an upper bound on reverted term DF.

Let me say immediately that my second assertion is wrong, and was due to confusion on my part: cutoff depth constrains reverted document length, not reverted DF; the latter depends (as Gene points out) on how many terms a document is highly retrievable by.

On the first point, I only concede partially. Gene observes (and Jeremy makes a similar point):

In general, basis queries that have a high probability/score for retrieving a particular docid are preferred. But if those basis queries retrieve a lot of other documents as well, the relative contribution of that score is lowered (due to a higher reverted document length).

The number of documents returned by a basis query, though, is limited to the cutoff. The IDF factor of very rare terms is retained, but there is no distinction in IDF between moderately rare (1000 < n < 10000) and frequent terms.

If we put both the cutoff selection effect I mention in my post, and the DF effect Gene and Jeremy describe above, then we get a bipartite retention of the original DF factor, one whose behaviour is dependent on the cutoff depth n. Terms with a DF less than n have their DF factor carried through in the reverted document length, making the term seem less exclusively “about” any given document it retrieves; terms with a DF more than n have their DF factor retained by a decreasing probability that they will be included at all in the reverted index for documents they occur in.

Gene also elaborates on what IDF means in the reverted index. Reverted IDF, roughly speaking, penalizes longer documents in the selection of expansion terms; and, as Gene observes, long documents are likely to be less focused. This effect does indeed seem calculated to improve the effectiveness of the expansion, although whether it would lead directly to the use of more selective terms is less clear.

I’m going to start tying myself in knots if I go too much further with this. Clearly, feedback using a reverted index does favour more selective expansion terms, and empirical results show that this leads to more effective retrieval (as measured on the TREC AdHoc collections — whether result diversity would be harmed is another question). I would, though, like to see a more mathemetical analysis of what happens to term occurence weights as they go through the reversion process. The goal should be to produce an expansion term selection and weighting scheme that could be applied (however inefficiently) without having to perform the manipulations of reversion. Perhaps a starting point would be to apply no cutoff to the retrieval used to create the reverted index — that is, convert each term’s full inverted list into a reverted document — and consider what reversion does to scores in this case.


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>