Synthetic Data in an Era of Data Privacy

Artificial Intelligence (AI) is being hailed as the breakthrough technology of this century. AI relies on training data to learn and recognize patterns. However, the collection of training data can present privacy concerns. For example, it is hard to anonymize faces in image datasets. It is also challenging to limit a human annotator’s access to personal data. And in an era of data-breaches, real datasets must be stored securely. Synthetic data offers a solution: bypass data privacy regulations entirely.

AI is a scary thing for many people. Its lack of transparency raises concerns, especially with regulators. From collecting massive training sets to analyzing photos, it can feel a bit ‘Big Brother’.

Fortunately, a new technology called synthetic data may ease these concerns. Synthetic data is artificial data that mimics real data. Synthetic data can train AI models while exceeding data privacy expectations.

Example: collecting data on neighborhood streets

Picture of neighborhood

Let’s say you want to identify cars in your neighborhood. You take photographs of your neighborhood, label the cars, and train your AI model. But think about it: there are some unexpected privacy concerns here. Do your neighbors feel comfortable with you taking pictures of their property? Can you securely store this data? How do you mask personally identifiable information?

 

These are questions that may be challenging to answer. Privacy laws are consistently changing. It can be difficult to stay up-to-date on the latest regulations. But what if there was another way to stay ahead of privacy laws while still training an accurate AI/ML model? There is: it’s called synthetic data.

 

Instead of collecting real photographs, you create a virtual 3D neighborhood. Then you use that 3D neighborhood to generate training data. By using a 3D neighborhood, you completely avoid data privacy regulations. That is how synthetic data can solve privacy concerns.

 

Real data has privacy concerns

 

Consumers are increasingly concerned about the privacy of their data. A 2019 study by Tealium found that half of people do not feel informed about the use of their data. 91% of respondents want the government to put in place stricter privacy regulations. People are demanding privacy, and governments will regulate the collection of personal data. This will make training data for AI much harder to collect. In some cases, government regulation may even invalidate existing datasets.

 

Real training sets will not be as useful if the government implements regulations. A government mandate to blur pedestrians would compromise self-driving car training sets. A regulation to blur faces would harm facial recognition software.

 

With real image training sets, there is no way to anonymously mask people or property identified in training data. Self-driving companies cannot blur pedestrians in their training sets. Drone delivery companies cannot blur front porches. Any alteration of the core training set may hinder the AI algorithm. Companies must balance data privacy with data accuracy. As data regulations increase, this will be an important consideration. 

 

GDPR privacy regulations

In 2018 the European Union’s introduced its privacy law, the General Data Protection Regulation (GDPR). It was the first major step towards data privacy regulation by a large governmental body. The law gives people the right to object to the processing of personal data. Article 21 of the GDPR states that “The data subject shall have the right to object, on grounds relating to his or her particular situation, at any time to processing of personal data concerning him or her which is based on point (e) or (f) of Article 6(1), including profiling based on those provisions.

GDPR logo

What this means for AI is that people in the EU have a right to object to being in a training set. For companies operating in Europe, this presents a significant challenge. It can be nearly impossible to identify and remove a single individual from a training set. If John Doe objects to being in a training set, it can be challenging to locate and remove his single photograph. Training sets can be massive, and finding a single training image is like a needle in a haystack.

 

Autonomous vehicle companies are not the only ones who face greater privacy regulation. All kinds of computer vision applications may be subject to increased regulation. As the use cases of computer vision expand, regulators will raise more questions. Many observers see GDPR as a trailblazer for privacy regulations. It is likely that other governments and jurisdictions to announce similar regulations. As these regulations expand, it makes collecting real training data less favorable. Companies will likely start generating synthetic data to avoid collecting personal information.

 

Cleansing real data can be challenging

 

Many companies think they can comply with GDPR by cleansing or anonymizing their data. In reality, it is challenging, especially for computer vision. For one, it can take a lot of labor hours to manually cleanse data. Human checkers must examine each training image for personally-identifying information. Furthermore, what if your human checker misses an object? A single uncompliant training image could expose you to GDPR regulations in Europe. And with regulations beginning to take shape worldwide, this is a real issue.

 

Anonymizing data has negative implications for an AI model as well. Blurring, masking, or altering an image may make it less effective for AI training. With this much hassle due to privacy regulations, it is worth exploring alternatives.

 

Synthetic data can solve this issue

One of the major advantages of synthetic data is that it is GDPR-compliant. By creating training data synthetically, there are zero human “data subjects.”

For computer vision, this is a gamechanger. Computer-generated content is already realistic. Consider AAA video games or animated movies: they are fictional but look real. Synthetic data for AI is the process of applying computer graphics to training sets. With this, we can create training sets that are realistic and void of data privacy concerns. 

Photorealistic training image of a truck.

The GDPR-compliant way to get image training data is through synthetic generation. Future regulations limiting real training data will not affect synthetic training data.

 

Should you invest in Synthetic or Real data?

 

Companies can choose whether to collect real data or generate synthetic data. Training data is the lifeblood of AI, so this is a critical decision. If your company is spending more than $10,000 on training data, we recommend synthetic data.

 

Synthetic data is advantageous over real data. Synthetic data can be automatically annotated, perfectly labeled, and generated in unlimited quantities. These three reasons are compelling alone. The legal benefits add a cherry on top. Without personal information, there is no basis for regulators to limit synthetic data. We recommend synthetic data to train your AI algorithms.

 

How do you use synthetic data?

We get it: synthetic data is great, but how do we use it? We have the answer. Work with Simerse. Everyone should be able to access synthetic data. At Simerse, we strive to help your company benefit from synthetic data.

 

Conclusion

 

Synthetic data is a way to future-proof your training sets. As privacy regulations become stricter, training sets will receive more scrutiny. Remember, we’re not lawyers or legal counsel, but we think these are important things to consider. We’re excited to help you adopt, implement, and benefit from synthetic data!

 

If you’re interested in working with Simerse for your synthetic data, get in touch here!