5 Things to Know About Real-World Data Collection for Machine Learning

Recent advancements in machine learning are absolutely phenomenal. There is no other way to put it. Research scientists from top-tier universities and data scientists from leading technology firms have created incredibly sophisticated models and frameworks to teach computers how to learn. Today, design of deep convolutional neural networks—so called because their design mimics the human brain—are considered to be at the leading edge of machine learning.

However, these deep convolutional neural networks do not learn on their own—rather, the inherent knowledge of machine learning must come from somewhere. We call this knowledge training data. By feeding large volumes of training data into a convolutional neural network, we can teach a computer how to detect and classify objects, measure size and relative depth, and understand the attributes of an image.

Often, the challenge of machine learning is not only in designing the right machine learning model but collecting and labeling the training data to teach the model. That brings us to our first thing to know about real-world data collection:

How many images do you need to train a machine learning model?

In order to properly teach a machine learning model, you need to feed huge datasets into the algorithm. For a computer vision model, 100 images will not do. How many images do you need to train a computer vision system? Well, you typically need at least 10,000 images for even the simplest object detection model. A typical or otherwise run-of-the-mill object detection model will likely require between 50,000 to 100,000 images. Even more astounding, an advanced computer vision model will probably require one million or more training images. That’s a lot of images!

Real-world training data can be challenging to collect.

As you may expect, gathering over a million images or frames of video for a computer vision model can be tough work. For certain applications, merely gathering the raw data is problematic. This problem is compounded if you are teaching a computer to detect rare events, called edge cases. By definition, edge cases rarely occur in the real world, yet they are often some of the most valuable and critical situations where you want the computer vision system to work properly. 

Since edge cases are, by their very nature, rarely occurring, it can be difficult to collect enough examples to teach a computer vision model. As a result, machine learning researchers often have to over-collect to capture a sufficient number of edge cases, which can increase the overall time and costs necessary to collect training data.

Real data is expensive to label.

If you’ve ever used CAPTCHA, you understand the tediousness of data labeling. Instead of clicking the car or bicycle to verify that you’re a human, imagine having to draw a box around every bicycle or car. Except instead of doing that once or twice, you have to do that 100,000 or more times. It is an absolutely tedious and exhausting process, and you are not going to want to do it yourself. Many companies outsource data labeling, but even that is an expensive proposition. As we’ll explain below, there are even more problems with data labeling.

Manual data labeling has its challenges.

Humans are creative, gifted, talented, social, and incredible in almost every way. But for all of our strengths, there are areas where computers are better suited to perform certain tasks than people. Data labeling is absolutely one of those tasks. Manual data labeling is tedious, slow, and generally prone-to-errors. Anytime you ask someone to draw a box around an object in a large image dataset, you are bound to get mislabeled images. It is not anyone’s fault, it is just that humans are not as good at performing robot-like tasks as, well, a robot.

For a clear example of this, look to a recent study by MIT researchers. The researchers found that 3% of images were incorrectly labeled in some of the most popular public image datasets. For a dataset with millions of images, 3% is an incredibly high figure. If you are collecting real world data for a commercial computer vision system, you will want to make sure that your data is correctly labeled, otherwise your training dataset may not be able to teach a machine learning model as well as you might hope.

There is an alternative to real world data collection.

We call it synthetic data. Why? Because synthetic data is computer-generated with the aim of imitating real-world datasets. In our opinion, synthetic data offers a solution to the ailments of real-world data collection. Let’s go through them. First, it is possible to create an unlimited amount of synthetic data. Need 100,000 training images? No problem. Need ten million? No problem! Since synthetic data is computer-generated, we are only limited by the storage capacity of hard drives, which are rising every year.

What about real-world data that is challenging to collect? Can synthetic data help solve that, too? Yes! In a synthetic data simulation, the environment is completely controlled. What that means is that we can overrepresent the number of edge cases or other phenomenon with relative ease. Even better, we can create situations in simulated environments which are challenging or even impossible to collect in the real-world.

Synthetic data is also perfectly labeled for machine learning. By understanding with complete ground-truth what is happening in a simulated environment, a synthetic generation system to apply pixel-perfect labels to each and every object in an image. This level of precision also allows for advanced labeling such as object segmentation, depth, and 3D bounding boxes.

In summary, we highly recommend that machine learning researchers check out the capabilities and utility of synthetic data generation. It holds several advantages over real world data collection, and in many cases can enhance the quality of machine learning training datasets. If you are interested in working with a synthetic data provider, please reach out to us here.