Errors in Machine Learning Benchmark Datasets

Introduction: The importance of dataset quality

It’s interesting that, while the importance of dataset quality is almost axiomatic in the physical sciences, machine learning research has overwhelmingly focused on model improvements rather than dataset improvements. Researchers primarily use performance on existing benchmark datasets as a proxy for model improvements. This is likely because creating a high-quality dataset is a much harder task than training a model. Whole companies, such as Scale AI, exist only to build high-fidelity machine learning datasets. In fact, many of these benchmarks, which are intended to be “gold standard” datasets contain labeling errors (e.g., ImageNet, Amazon Reviews) as shown by Northcutt et al. Importantly, if such issues can plague meticulously curated benchmarks, you can be sure that these types of label errors are even more pervasive in real-world datasets that are used to inform high-stakes decisions in the financial, legal, and healthcare domains.

Understanding and correcting noisy labels in these datasets is key (🔐) to mitigating risk and improving decision making. In this blog, I explore the confident learning approach developed by Northcutt et al., and apply it to the MultiNLI (Multi-Genre Natural Language Inference) dataset. This dataset forms the basis for one of the tasks in the canonical “GLUE” natural language processing benchmark.

Textual entailment and the MultiNLI dataset

Textual entailment is a natural language processing task that involves understanding the logical relationship between two pieces of text called the “premise” and the “hypothesis”. Entailment can be framed as a three-class classification task in which a model attempts to determine if the “hypothesis” can be logically inferred from the “premise,” assigning the two pieces of text to one of three possible labels: contradiction, neutral, and entailment. Examples of each of these labels in the MultiNLI dataset are shown below:

Contradiction

Premise: Your contribution helped make it possible for us to provide our students with a quality education.

Hypothesis: Your contributions were of no help with our students’ education.

Neutral:

Premise: yeah well you’re a student right

Hypothesis: Well you’re a mechanics student right?

Entailment:

Premise: The other name, native well is, as a later explorer David Carnegie, author of Spinifex and Sand (1898), points out, a misnomer.

Hypothesis: The alternative name, resulting from a translation, was a misnomer according to the explorer David Carnegie.

The MultiNLI dataset (which was created at NYU by Williams et al.) contains about 433k such sentence pairs annotated with textual entailment labels. The crowd-labeled pairs are taken from a variety of genres of both spoken and written text.

Confident learning: a method for identifying noisy labels

Confident Learning is a data-centric approach for identifying which data in a dataset has noisy labels (i.e., which data is mislabeled or confusing). The approach was developed at MIT by Northcutt et al.. The intuition behind this approach is that a model’s confidence in its predictions on a held-out set can be used to identify and correct mislabeled data within that held-out set (hence the name confident learning). From a practical standpoint, given a fixed classification ontology, data points identified as having noisy labels can either be relabeled or removed.

Confident learning can also reveal when a classification ontology is not well-structured. Take, for instance, an image classification task that differentiates between two breeds of cows: Ayrshire and Guernsey.

Which breed of dairy cow are YOU?

Due to the visual similarities between these breeds, classifying them based on images alone may result in low prediction confidence. Simply discarding these data is likely sub-optimal, however, as the classifier is accurately recognizing the subjects as cows. Here, the issue may lie in the design of the classification ontology itself, and it might be more practical to combine these two categories into a single, broader class, such as ‘brown cow’.

Feel free to skip this next portion if you are not interested in the math, it’s not necessarily important for developing an intuition around the approach. But I like math so here is a little: Confident learning assumes that all datapoints in a labeled dataset, \(X := (x, \tilde{y})\) (where \(\tilde{y}\) is the potentially noisy assigned label), have a latent true label \(y^*\). Confident learning aims to estimate \(p(\tilde{y}, y^*)\), which is the joint distribution between the noisy and true labels. This is by creating a matrix called the confident joint, which is given by:

\[C_{\tilde{y}, y^*}[i][j] := |\hat{X}_{\tilde{y}=i,y^*=j}| \text{ where } \hat{X}_{\tilde{y}=i,y^*=j} := \{ x \in X_{\tilde{y}=i} : \hat{p}(y = j; x, \theta) \geq t_j \}\]

where the threshold \(t_j\) is given by

\[t_j = \frac{1}{|\hat{X}_{\tilde{y}=j}|} \sum_{x \in \hat{X}_{\tilde{y}=j}} \hat{p}(\tilde{y} = j; x, \theta)\]

Each entry in the confident joint is described by a count of the number of items in the dataset that are labeled as \(\tilde{y}\) and for which the model is confident that the true label is \(y^*\). Items that fall on the diagonal are items that are labeled correctly.

Finetuning a BERT model on the MultiNLI dataset

I link a Colab notebook for running the model training, as well as performing confident learning to identify noisy labels. I performed hyperparameter tuning on the training time in epochs, weight decay, batch size, and learning rate using the Ray Tune hyperparameter tuning package. By default, Ray Tune uses a tree-structured parzen estimator approach, which is a Bayesian optimization method introduced by Bergstra et al.. I identified the optimal training hyperparameters to be training for 1 epoch with a batch size of 64, and a learning rate of 2.88e-05. I still performed training on an A100 to investigate the impact of larger batch sizes, but it is actually possible to train this model on a lower-in-memory GPU like a T4 or V100.

Looking for label errors

After training a model on the MultiNLI task, I used the confident learning approach to look for label errors in the matched validation set. The confident joint matrix for the matched validation set is shown below.

Confident joint for the MultiNLI matched validation set. The x-axis corresponds to the latent true label, and the y-axis corresponds to the predicted label.

Of the 9815 entries, 595 were off-diagonal in the confident joint, corresponding to ~6% of all entries. The data suggest that most label noise arises in relation to the neutral class. This is to be expected, as the neutral class bridges the other two classes and is generally the most nebulous of the three. The data suggest that there is some labeler bias away from the contradiction label towards the entailment label, with the entries below the diagonal in the confident joint having more entries than the entries above the diagonal.

Below, I present a selection of cherry-picked off-diagonal entries. Upon a vibes-based manual inspection (extremely rigorous), these entries generally fall into three categories:

  1. There are indeed many mislabeled data points. This is somewhat expected given the crowd-sourced nature of the dataset, even though there were likely multiple annotators per data point.
  2. Many off-diagonal entries contain vague or unrelated premises or hypotheses. These are not only more difficult to label but also likely irrelevant to the task.
  3. Lastly, a significant number of the off-diagonal examples are simply hard! As a result, they are either frequently mislabeled by human annotators or confidently mislabeled by the model.

\(\tilde{y}=entailment, y^*=neutral\)

\(\tilde{y}=entailment, y^*=contradiction\)

\(\tilde{y}=neutral, y^*=entailment\)

\(\tilde{y}=neutral, y^*=contradiction\)

\(\tilde{y}=contradiction, y^*=entailment\)

\(\tilde{y}=contradiction, y^*=neutral\)

Final thoughts

I like confident learning! My assessment is that the primary benefits of confident learning are that it’s model agnostic and it works. I wonder if there is a way to incorporate uncertainty estimates in situations where they are available (neural network dropout, deep evidential classification). There is a whole company built around algorithmic datacleaning methods (with confident learning seeming to be one of the core offerings) now called Cleanlab. Cool stuff!