No results.

If I had a ten-thousand-node cluster…

GD Star Rating

Another aspect of this year’s SIGIR reviewing, though hardly a new one, was the number of papers from industry. As a fellow reviewer observed, these often come across as nicely polished internal tech reports; the kind of thing you should probably read once your globally-distributed search infrastructure has reached the million-query-a-day mark; a sort of post-coital pillow talk between the Yahoo and Bing engineering teams, upon which the rest of us are eavesdropping (and Google presumably doesn’t care to).

My own concern is not so much with the general applicability of this work, as it is with its reproducibility. In general, industry research is evaluated over datasets to which other researchers have no access. There is, therefore, no hope for the direct validation of the reported results; and given the specificity of the environments and datasets described, little hope for indirect reproduction, either.

Anyone who is working in experimental data analysis who has not already done so needs to read Keith A. Baggerly and Kevin R. Coombes, Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology, The Annals of Applied Statistics, 3(4) (2009). In it, the authors reverse-engineer microarray studies on responsiveness to cancer treatment, studies which omit the data and details needed for direct reproduction. Baggerly and Coombes find a catalogue of errors: sensitive and resistant labels for subjects switched; data columns offset by one; faulty duplication of test data; incorrect and inconsistent formulae for basic probability calculations; and so forth. They comment that “most common errors are simple … [and] most simple errors are common”. And these are not in obscure papers, but in large-team studies, which have lead to patent grants and clinical trials — trials in which (often enough) errors in the original papers meant that patients were being given contra-indicated treatments.

Fortunately, in information retrieval, no lives are at risk; what industrial research groups do in the privacy of their data centers is between them and their multiply-redundant servers. But scientifically, we are in the same situation, or worse. We are accepting for publication research that we have no way of checking or validating, that could be riddled with the most basic of errors. As readers and reviewers, we have only our sense of whether something feels right, and that for a domain in which we have little or no direct experience. Is this really desirable?


Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>