A Review of the MIT Synthetic Data Vault

The Synthetic Data Vault (SDV) is an MIT-licensed, open source project for generating tabular, relational, and time-series synthetic data. Ultimately, the project is aimed at helping researchers acquire non-visual training data.

This project is particularly useful for metadata. For example, in the context of medical records, the Synthetic Data Vault could be used to generate age, height, and weight data. However, this project would not be helpful for visual data, such as using computer vision to analyze X-Rays or cancer screenings. Nevertheless, the Synthetic Data Vault is very promising.

To give a little background, the project is spearheaded by Kalyan Veeramachaneni, a principal research scientist at MIT’s Laboratory for Information and Decision Systems. It is clear that Veeramachaneni is a believer in synthetic data: he is quoted in an MIT News article as saying “We’re just touching the tip of the iceberg.” Here at Simerse, we agree.

The underlying premise for Synthetic Data Vault is that it trains a machine learning model on real data, and then generates new synthetic data as a result of the model. It is actually quite an ingenious way to make sure that the synthetic data is fully representative of the real data yet appropriately masked.

The Synthetic Data Vault is divided into several components:

Relational data. This tool is used to help a machine learning (ML) model learn how fields from tables are related. To do this, the software uses a Hierarchical Modeling Algorithm to recursively go through a relational database and use tabular algorithms. This tool relies heavily on pandas.DataFrames which is a Python library.

Tabular data. With this tool, you can teach an ML model to synthesize tabular data (and Copulas and GANs). Specifically, this component uses a Gaussian Copula model (i.e. a distribution over a unit cube) to anonymize personally identifiable information.

Time series. The SDV uses a Probabilistic Autoregressive model to enable training on multivariate timeseries data. It is worth noting that this component is still under active development.

Benchmarking. The vault also provides a component for assessing the different techniques. Specifically, this component uses Machine Learning efficacy and Bayesian Likelihood metrics as baselines.

Metrics. The component of the vault can be used to evaluate synthetic data. Essentially, this tool allows you to compare real data against synthetic data.

The Synthetic Data Vault also has a lot of helpful resources for learning how to use these tools. The website offers tutorials hosted on GitHub, has a Slack community, and provides thorough documentation. There are also links to academic papers which reference the Synthetic Data Vault for people who enjoy reading research papers. Ultimately, the Synthetic Data Vault offers a compelling way to acquire tabular data for machine learning.

At Simerse, we like the Synthetic Data Vault because it solves a problem that we do not address: tabular data. As you may know, Simerse is focused on synthetic data for computer vision. This type of image and video-based synthetic data is different than tabular synthetic data. And while we obviously believe strongly in our solution for computer vision researchers, there are many researchers and AI developers who will also benefit from the Synthetic Data Vault’s focus on tabular data.

Tabular data is widely used in many industries. Moreover, personally identifiable information such as name, age, height, and weight is also tabular. This type of information can be challenging to work with for AI researchers: for example, collecting this data may be prohibited, or handling this data may be subject to additional precautions. Synthetic data offers a way to obscure and protect the underlying information while still adequately training AI systems.

In a world where AI systems continue to advance in both the quality technology and number of applications, synthetic tabular data will be increasingly valuable to researchers and developers. Most AI models require large, well-trained datasets, and therefore the ability to train AI while overcoming thorny PII issues is highly useful.

One of the major areas where synthetic, tabular data may help researchers is in the healthcare industry. Many AI practitioners are looking to bring artificial intelligence and machine learning to healthcare, but often run into PII issues around healthcare data. Data generation tools like MIT’s Synthetic Data Vault may help AI researchers avoid those issues.

Likewise, Simerse can help Computer Vision researchers acquire the visual training data they need to power healthcare AI, and completely avoid PII issues since Simerse generates 100% synthetic data for AI and machine learning.

Ultimately we hope you found this short summary of MIT’s Synthetic Data Vault helpful, and we encourage you to head over to the website here to check it out for yourself. If you’re interested in training data for Computer Vision, send us as message here.