No results.

Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets

GD Star Rating

Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke
University of Waterloo

The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general Web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam — pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset.

We show that a simple content-based classifier with minimal training is efficient enough to rank the “spamminess” of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of “honeypot” queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering — from among the worst to among the best.

Get the paper at arxiv.org

(Commentary from Mounia Lalmas)

The ClueWeb09 dataset was created to support research in information retrieval and related technologies. The dataset was crawled from the general web in early 2009 and consists of 1 billion web pages, in ten languages. The dataset is used by several tracks of the TREC conference, e.g. Web track, Session track. Because the data set is a direct crawl from the web, it is likely to contain a good proportion of spam (self-promotion e.g. word stuffing and mutual promotion e.g. link farm). The aim of this paper is to examine the effect of spam on retrieval effectiveness, and to see what can be done about it.

This examination is very important, as many researchers around the world participating at TREC, are concerned with the development of effective retrieval strategies, and not about, albeit important, spam issues. It is however very likely, and as demonstrated in the paper, that their approaches will not perform too well, not necessarily because their approaches/models/strategies are not effective, although there is always room for improvement, but because they did not account (at all or properly) for the amount of spam in the dataset.

Previous TREC evaluation tracks using web-based dataset have used corpora with little spam. When spam was identified as an issue, e.g. with the Blog track, its impact was not thoroughly examined. The authors claim, I quote “the use of the ClueWeb09 dataset places the spam issue front and center at TREC for the first time.” I fully agree with, and I know that we are going to be seriously confronted with it, as we will be using the dataset for our own work.

The paper provides concrete answers to the spam issue, as accounted with the WebClue09 dataset, which I believe will be of great use/help/interest to the IR/TREC research community:

  • It provides a complete methodology to label large dataset, here ClueWeb09, with minimal computation and training. Each generated label is a percentile score that can be used as input to classify a page as “spam” or “not spam”, or other tasks (re-ranking).
  • Several complete sets of spam labels, available for download at durum0.uwaterloo.ca/clueweb09spam.
  • Extensive experimental results showing a significant and substantive positive impact on effectiveness results, when the labels are used to remove/act upon “spammy” documents. This was demonstrated using the runs officially submitted by participants to the TREC Web ad hoc and relevance feedback tasks.

This paper contains an extensive and systematic study of spam in a large real-world dataset, and the first quantitative results of the impact of spam filtering on retrieval effectiveness. The methodology and the produced sets of spam labels can be used by others (1) to ‘clean’ – whatever the desired level of cleanness — ClueWeb09, or similar data set, so that they can concentrate on their main research aims; and (2) to use as a benchmarks for comparable studies.


Comments are closed.