In my last post, we saw that randomly swapping training labels, in a (simplistic) simulation of the effect of assessor error, leads as expected to a decline in classifier accuracy, with the decline being greater for lower prevalence topics (in part, we surmised, because of the primitive way we were simulating assessor errors). In this post, I thought it would be interesting to look inside the machine learner, and try to understand in more detail what effect the erroneous training data has. As we’ll see, we learn something about how the classifier works by doing so, but end up with some initially surprising findings about the effect of assessor error on the classifier’s model.
Most statistical machine classifiers, and certainly the Vowpal Wabbit classifier that we’ve used to date, work in two stages. First, you feed the classifier labelled training examples (in our case, documents marked by a human as either relevant or irrelevant), and the classifier builds a statistical model. Then, the classifier applies the model to unlabelled documents, and tries to infer their relevance.
How the classifier actually sees the document is not as a cogent piece of text, but reduced to a list of scored features, most basically in text classification a list of words and how many times they occur in the document. The classifier then tries to figure out an equation, with weights for each term, that is applied to each document, and best fits the labelled training data. That equation looks something like so:
where is the relevance (for training), or relevance score (for predicting); is the number of times the ‘th word in the collection’s vocabulary occur in the document in question; and is the weight, learnt by the classifier from the training data, that attaches to this ‘th term. At least for simpler models (again, such as used in Vowpal Wabbit), these term weights are the model, and we can understand something about how the classifier is making its decisions by examining these term weights—though, as we will see, this understanding has its limitations.
Let’s start with the term weights from an error-free model, the model for the C171 topic (“share capital”), learnt from the 20k training examples described in the last post. The top ten most positive term weights (along with counts of the number of positive and negative training examples they occur in) are as follows:
These terms all make sense for a topic that is about share capital. (The terms with the most negative weights, which begin “qtly”, “loom”, “stimul”, “eps”, are less immediately interpretable).
So what happens if we start introducing (random) assessor errors? With a 10% error rate, the top ten most positive weighted terms become:
Five of these terms were in the error-free top ten, and a sixth (“float”) was at rank 12; but the remaining four terms (“punish”, “cheer”, “unload”, and “exploit”) all had negative weight in the error-free model; appear in none or only one relevant training document; and are evidently poor indicators that a document is about share capital.
We can visualize the change (and, presumably, derangement) in the model with assessor error by charting term weights in the error-prone model against the error-prone one, for increasing error rates. Doing so gives us the following result:
Each point in this figure represents a term—only those appearing in 100 or more training documents are included, for clarity; the point’s value is the term’s weight in the error-free model; and the point’s value is the term’s weight in the error-prone one. The value in the top left is the Pearson’s correlation coefficient, a summary statistic giving the strength of relationship between the two sets of weights; a correlation of means there is no relationship, while means a perfect linear relationship.
With increasing error, the error-prone weights diverge further and further from the error-free ones; with an error rate of 30%, the relationship between them is little better than random.
So far so understandable. However, when we examine the model derangement for the other two topics, we find that there is little difference in correlation with rate of error, despite there being a substantial (prevalence-dependent) difference in classifier effectiveness. The relationship for topic M14 with 30% error rate:
is also little better than random, yet depth for 80% recall with this model is the far-from-random 26.0%. The following table summarizes the error model correlations and depth for recall values for the three topics:
|Topic||Prevalence||Correlation by error rate||DFR@80 by error rate|
This apparent disconnect between model derangement and classifier effectiveness in the presence of assessor error is surprising at first blush, but (I strongly suspect) is due to the simple way in which we are examining the model. Document relevance scores under statistical classification derive not from a few, high-weighted terms, but from the weighted sum of all terms in the document. Therefore, patterns of co-occurrence are very important in determining document predictions.
Purely random error generates considerable noise in the first-level weights of terms, but (as the error is random) much less noise in term co-occurrence statistics. Where prevalence is low, and hence positive examples are few, random error is still sufficient to drown out the signal from co-occurrence in correctly labelled documents. But where prevalence is higher, meaningful co-occurrence still outweighs random co-occurrence, and the model remains robust (though still, of course, partially degraded) in its effectiveness.