# Assessor error and term model weights

GD Star Rating

In my last post, we saw that randomly swapping training labels, in a (simplistic) simulation of the effect of assessor error, leads as expected to a decline in classifier accuracy, with the decline being greater for lower prevalence topics (in part, we surmised, because of the primitive way we were simulating assessor errors). In this post, I thought it would be interesting to look inside the machine learner, and try to understand in more detail what effect the erroneous training data has. As we’ll see, we learn something about how the classifier works by doing so, but end up with some initially surprising findings about the effect of assessor error on the classifier’s model.

Most statistical machine classifiers, and certainly the Vowpal Wabbit classifier that we’ve used to date, work in two stages. First, you feed the classifier labelled training examples (in our case, documents marked by a human as either relevant or irrelevant), and the classifier builds a statistical model. Then, the classifier applies the model to unlabelled documents, and tries to infer their relevance.

How the classifier actually sees the document is not as a cogent piece of text, but reduced to a list of scored features, most basically in text classification a list of words and how many times they occur in the document. The classifier then tries to figure out an equation, with weights for each term, that is applied to each document, and best fits the labelled training data. That equation looks something like so:

$r = \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n$

where $r$ is the relevance (for training), or relevance score (for predicting); $x_i$ is the number of times the $i$‘th word in the collection’s vocabulary occur in the document in question; and $\beta_i$ is the weight, learnt by the classifier from the training data, that attaches to this $i$‘th term. At least for simpler models (again, such as used in Vowpal Wabbit), these term weights are the model, and we can understand something about how the classifier is making its decisions by examining these term weights—though, as we will see, this understanding has its limitations.

Let’s start with the term weights from an error-free model, the model for the C171 topic (“share capital”), learnt from the 20k training examples described in the last post. The top ten most positive term weights (along with counts of the number of positive and negative training examples they occur in) are as follows:

FeatureName Weight PosDF NegDF
ipo 0.858 78 23
buyback 0.790 33 23
underwrit 0.758 77 107
split 0.739 35 273
offer 0.669 159 1086
repurchas 0.652 49 168
stock 0.615 256 2970
back 0.603 55 2286
shar 0.580 411 4476
issu 0.517 161 2999

These terms all make sense for a topic that is about share capital. (The terms with the most negative weights, which begin “qtly”, “loom”, “stimul”, “eps”, are less immediately interpretable).

So what happens if we start introducing (random) assessor errors? With a 10% error rate, the top ten most positive weighted terms become:

FeatureName Weight PosDF NegDF
split 0.609 66 242
ipo 0.589 71 30
buyback 0.587 31 25
unload 0.543 16 64
repurchas 0.432 60 157
underwrit 0.422 80 104
cheer 0.417 14 67
punish 0.406 21 100
exploit 0.400 19 70
float 0.393 63 195

Five of these terms were in the error-free top ten, and a sixth (“float”) was at rank 12; but the remaining four terms (“punish”, “cheer”, “unload”, and “exploit”) all had negative weight in the error-free model; appear in none or only one relevant training document; and are evidently poor indicators that a document is about share capital.

We can visualize the change (and, presumably, derangement) in the model with assessor error by charting term weights in the error-prone model against the error-prone one, for increasing error rates. Doing so gives us the following result:

Relationship between error-free and error-prone model weights for different assessor error levels, RCV1v2 topic C171, 20k training documents.

Each point in this figure represents a term—only those appearing in 100 or more training documents are included, for clarity; the point’s $x$ value is the term’s weight in the error-free model; and the point’s $y$ value is the term’s weight in the error-prone one. The value in the top left is the Pearson’s correlation coefficient, a summary statistic giving the strength of relationship between the two sets of weights; a correlation of $0$ means there is no relationship, while $1$ means a perfect linear relationship.

With increasing error, the error-prone weights diverge further and further from the error-free ones; with an error rate of 30%, the relationship between them is little better than random.

So far so understandable. However, when we examine the model derangement for the other two topics, we find that there is little difference in correlation with rate of error, despite there being a substantial (prevalence-dependent) difference in classifier effectiveness. The relationship for topic M14 with 30% error rate:

Term weights with 30% assessor error, compared to error-free weights, RCV1v2 topic M14, 20k training documents

is also little better than random, yet depth for 80% recall with this model is the far-from-random 26.0%. The following table summarizes the error model correlations and depth for recall values for the three topics:

Topic Prevalence Correlation by error rate DFR@80 by error rate
1% 5% 10% 20% 30% 0% 1% 5% 10% 20% 30%
M14 10.6% 0.80 0.56 0.42 0.26 0.16 8.7% 8.8% 9.1% 9.9% 15.0% 26.0%
C171 2.3% 0.73 0.49 0.37 0.24 0.13 4.2% 6.4% 12.0% 19.6% 32.7% 44.5%
G158 0.5% 0.67 0.41 0.29 0.17 0.12 8.5% 22.6% 37.4% 46.1% 52.3% 72.8%

This apparent disconnect between model derangement and classifier effectiveness in the presence of assessor error is surprising at first blush, but (I strongly suspect) is due to the simple way in which we are examining the model. Document relevance scores under statistical classification derive not from a few, high-weighted terms, but from the weighted sum of all terms in the document. Therefore, patterns of co-occurrence are very important in determining document predictions.

Purely random error generates considerable noise in the first-level weights of terms, but (as the error is random) much less noise in term co-occurrence statistics. Where prevalence is low, and hence positive examples are few, random error is still sufficient to drown out the signal from co-occurrence in correctly labelled documents. But where prevalence is higher, meaningful co-occurrence still outweighs random co-occurrence, and the model remains robust (though still, of course, partially degraded) in its effectiveness.