Synthetic Data Research | Data Mining Research

Synthetic data, data augmentation and developing the data de-identification:

Large datasets that are diverse and representative (of the heterogeneity of phenotypes in the gender, ethnicity and geography of the individuals or patients, and in the healthcare systems, workflows and equipment used) are necessary to develop and refine best practices in evidence-based medicine involving artificial intelligence. To overcome the paucity of annotated medical data in real-world settings, synthetic data are being increasingly used. Synthetic data can be created from perturbations using accurate forward models (that is, models that simulate outcomes given specific inputs), physical simulations or AI-driven generative models. Data augmentation is another but closely related field. Data augmentation is an essential part of the training process applied to deep learning models. The motivation is that a robust training process for deep learning models depends on large annotated datasets, which are expensive to be acquired, stored and processe.

Synthetic data is artificial data that is created by using different algorithms that mirror the statistical properties of the original data but does not reveal any information regarding real people. Synthetic data generated with the purpose of preserving privacy, testing systems or creating training data for machine learning algorithms. But Synthetic data generation is critical since it is an important factor in the quality of synthetic data; for example, synthetic data that can be reverse engineered to identify real data would not be useful in privacy enhancement.

OSRC Synthetic data research is important for three reasons: privacy, product testing and training machine learning algorithms. Open source research established a partnership with replicaAnalytics® to work on this subject.

Methods and applications:

Businesses face a trade-off between data privacy and data utility while selecting a privacy-enhancing technology. Therefore, they need to determine the priorities of their use case before investing. Synthetic data does not contain any personal information, it is a sample data that has a similar distribution with original data. Though the utility of synthetic data can be lower than real data in some cases, there are also cases where synthetic data is almost as valuable as real data. Our research in synthetic data will:

Allow better use early use of datasets.
Experiment with datasets without exposing the sensitive information.
Train algorithm on the datasets.

OSRC Research

Synthetic data, data augmentation and developing the data de-identification: