No results.

Query logs and information retrieval research

GD Star Rating

About one year ago, Bruce Croft asked the IR community for help with getting access to query logs for academia,

The goal of this project is to create a database of web search activity that will be provided to the information retrieval research community to use on current and future information retrieval research projects.

To accomplish this, the Lemur Project developed a toolbar to be voluntarily installed by users. After a year of data collection, the project has been aborted,

Given that we have gathered the equivalent of less than 6 seconds of Google traffic (assuming 500 million queries per day) in one year, we have decided to terminate the project.

This is pretty depressing news. Admittedly, part of this depression originates from my guilt over not having contributed to the project myself. However, a more substantial part stems from the potential this data set had to be groundbreaking, perhaps similar to the release of the first Tipster collections. Although this was way before my time, I imagine the sudden release of a large, public corpus resulted in a tremendous amount of activity and excitement.

Information retrieval research has had large collections of documents for a few decades now. We evaluate on a few hundred queries and publish results. With some exceptions, the majority of interest in the field has focused on scaling up corpora. As a result, we have rich set of tools to analyze and retrieve documents from large corpora.

There are two things missing from this model: a rich stream of queries coming into the system and a rich stream of interactions between users and documents. Our friends in the CHI and information science communities have been doing a great job with understanding the important factors involved in user behavior on laboratory scale. However, I’m going to draw an analogy here between small scale user studies for IR and document-level NLP analysis for IR that may raise a few eyebrows. I believe that many IR researchers would argue that, given the choice between a corpus-driven approaches and NLP approaches to IR, they would opt for more data. This is despite the rich analysis NLP can provide. Similarly, I believe that the fine-grained analysis provided by laboratory studies may be less important than very large scale analysis of user behavior. Of course, both the results about NLP for IR and the claim about laboratory experiments are based on relatively limited experiments (e.g. small sets of queries). We should, as a community, continue research in all of these directions.

Having said this, let’s consider some motivations for web query logs and IR research,

Claim 1. Web query logs will help with the contribution to web search research.

There is no doubt that query logs are important for any search engine, web or otherwise. However, query logs are only one of the many sources of interaction data available in production. There are many, many other signals which can be effectively exploited for query understanding and document ranking. In my opinion, outside of starting its own web search engine, academia will always be scurrying to catchup to industry’s data sources.

I convinced myself a few years ago that the resources required to build and maintain a web search engine may never exist in academia. This is not to say that academic IR researchers should give up on having impact on web search engines. IR research several decades old continues to impact modern search engine design. What needs to be determined is how the current academic IR researchers can more directly address the problems confronted by web search companies. I personally believe that a tight coupling between academic and industrial research labs needs to exist. This could be accomplished in a number of ways.

  1. add value to an existing search engine’s interface. If search engines provide ranker APIs, academics can develop new interfaces which may attract users and, as a result, interaction data.
  2. teach the IR fundamentals during the academic year/perform intense interaction during the summer during internships or other collaborations. I am most familiar with Yahoo’s Key Scientific Challenges Fellowships and Faculty Engagement Grants. Similar programs exist at other web search engines.
  3. develop high-quality, public web search engine simulators which provide students/researchers with the ability to test algorithms in silico. Our SIGIR 2009 paper made extensive use of simulation whose parameters were grounded in real world data. Systems research in computer architecture or computer networking have adopted this approach for a while. SIGIR 2010 will be hosting a workshop on simulated interaction.

No doubt there are many, many other alternatives.

Claim 2. Web query logs will help with the contribution to production search research.

As stated earlier, IR research has looked at the document side for many, many problems. This research has benefited web search as well as search in other domains such as legal, news, and enterprise search.

User behavior data improved production web search engines; user behavior data will no doubt improve production non-web search engines. Just as with web search though, this data does not exist in academia.

I believe, though, that the barrier to entry for non-web/vertical search engines is somewhat lower. The collections are smaller and manageable. At the same time, document representations can be richer for verticals, interaction is less constrained, and, as a result, the potential for attracting users may be higher than with portal web search engines.

If an academic institution maintained a domain-specific production search engine, academic research could become more relevant to industrial search engines. For example, academic institutions would easily be able to publish about query logs, interaction, large scale adaptation, and online learning with large scale real world data.  One important, unresolved question is how to come to terms with experimental reproducibility and production data which is often closed due to privacy reasons.

Academic IR research will continue to contribute to general IR research. Students trained in IR fundamentals will continue to be strong candidates for research and development in production search companies. I believe that there is room for greater impact. How that happens remains to be seen.


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>