5 Key Findings from MIT's Research Into Data Annotation & Labeling

Researchers at the MIT Computer Science & Artificial Intelligence Lab recently wrote a paper examining common datasets used for Machine Learning (ML). What they found was highly surprising, with significant implications for AI research. In this article, we will review the five most relevant findings of their research and analyze the potential impact on the field of AI.

The Average Error Rate was over 3%

When examining the sample data, which consisted of 10 major datasets frequently cited by AI/ML experts, the MIT researchers found a 3.4% error rate. ImageNet, one of the most well-known open-source AI datasets, had an error rate of approximately 6%. Wow!

For AI researchers, this is an alarming development. An observed error rate in excess of 3% could significantly impact the performance of machine learning models.

These Datasets Are Widely Cited.

The authors of the paper noted that the 10 datasets chosen for the study have been cited by the academic community over 100,00 times. The popularity of these datasets in combination with the error rate should raise questions about how AI and Machine Learning research studies are conducted. If AI researchers are testing their models against erroneously labeled datasets, then those errors may explain some discrepancies in AI model results.

Some Training Images Were Completely Mislabeled.

When it comes to data labeling errors, we must consider the degree to which something is mislabeled. For example, if a bucket of baseballs is labeled only as a single object, rather than multiple objects, the machine learning model may not correctly classify those objects in the test set. However, at least there is some truth to the underlying label. What machine learning researchers should be careful to avoid is completely wrong labels: for example, labeling a bucket of baseballs as a “hat” would be a completely wrong label. In the ImageNet dataset, the MIT researchers pointed out several examples of completely incorrect labels.

Faulty Training Data Can Corrupt Benchmarks

One of the major problems with faulty data occurs when incorrectly labeled images are used as benchmarks. Whenever you're doing machine learning, you want to be able to assess the accuracy and validity of your model. However, the MIT researchers point out that if faulty training data is part of the test set, then you can't trust the results of your benchmarks.

We consider improperly labeled data one of the biggest pitfalls with real-world data labeling. At a minimum, AI/ML researchers should ensure that their test set is 100% properly labeled. Then, researchers should examine their training set to minimize the "label noise" of the dataset. Taking practical steps to ensure proper data labeling will benefit your AI/ML model in the long run.

You Can See Particularly Egregious Examples for Yourself.

The MIT researchers even created a new website highlighting some of the most egregious errors in ImageNet. In the website, the researchers show how ImageNet routinely mixes up different animal species and sometimes does not contain a proper label for the central focus of the image. For example, ImageNet identifies a meerkat as a red panda, and an Irish Water Spaniel as an Irish Wolfhound. These may not seem like major errors, but incorrect labels can negatively affect the fidelity of a machine learning model.

However, there is a solution: synthetic data.

Fortunately, there is an alternative to real-world data collection and labeling. It's called synthetic data. This type of data is computer-generated and automatically labeled for machine learning. At Simerse, we specialize in training AI and machine learning models using synthetic data. By automatically labeling data, you can ensure that 100% of objects in an image are properly annotated. Moreover, synthetic data offers significant flexibility when it comes to data annotation. All in all, we highly recommend synthetic data generation.