5 Key Findings from MIT's Research Into Data Annotation & Labeling
The Average Error Rate was over 3%
When examining the sample data, which consisted of 10 major datasets frequently cited by AI/ML experts, the MIT researchers found a 3.4% error rate. ImageNet, one of the most well-known open-source AI datasets, had an error rate of approximately 6%. Wow!
For AI researchers, this is an alarming development. An observed error rate in excess of 3% could significantly impact the performance of machine learning models.
These Datasets Are Widely Cited.
The authors of the paper noted that the 10 datasets chosen for the study have been cited by the academic community over 100,00 times. The popularity of these datasets in combination with the error rate should raise questions about how AI and Machine Learning research studies are conducted. If AI researchers are testing their models against erroneously labeled datasets, then those errors may explain some discrepancies in AI model results.
Some Training Images Were Completely Mislabeled.
Faulty Training Data Can Corrupt Benchmarks
One of the major problems with faulty data occurs when incorrectly labeled images are used as benchmarks. Whenever you're doing machine learning, you want to be able to assess the accuracy and validity of your model. However, the MIT researchers point out that if faulty training data is part of the test set, then you can't trust the results of your benchmarks.
We consider improperly labeled data one of the biggest pitfalls with real-world data labeling. At a minimum, AI/ML researchers should ensure that their test set is 100% properly labeled. Then, researchers should examine their training set to minimize the "label noise" of the dataset. Taking practical steps to ensure proper data labeling will benefit your AI/ML model in the long run.
You Can See Particularly Egregious Examples for Yourself.
However, there is a solution: synthetic data.
Fortunately, there is an alternative to real-world data collection and labeling. It's called synthetic data. This type of data is computer-generated and automatically labeled for machine learning. At Simerse, we specialize in training AI and machine learning models using synthetic data. By automatically labeling data, you can ensure that 100% of objects in an image are properly annotated. Moreover, synthetic data offers significant flexibility when it comes to data annotation. All in all, we highly recommend synthetic data generation.