Large datasets that are diverse and representative (of the heterogeneity of phenotypes in the gender, ethnicity and geography of the individuals or patients, and in the healthcare systems, workflows and equipment used) are necessary to develop and refine best practices in evidence-based medicine involving artificial intelligence. To overcome the paucity of annotated medical data in real-world settings, synthetic data are being increasingly used. Synthetic data can be created from perturbations using accurate forward models (that is, models that simulate outcomes given specific inputs), physical simulations or AI-driven generative models. Data augmentation is another but closely related field. Data augmentation is an essential part of the training process applied to deep learning models. The motivation is that a robust training process for deep learning models depends on large annotated datasets, which are expensive to be acquired, stored and processed.
Synthetic data is artificial data that is created by using different algorithms that mirror the statistical properties of the original data but does not reveal any information regarding real people. Synthetic data generated with the purpose of preserving privacy, testing systems or creating training data for machine learning algorithms. But Synthetic data generation is critical since it is an important factor in the quality of synthetic data; for example, synthetic data that can be reverse engineered to identify real data would not be useful in privacy enhancement.
Our work with synthetic data is important for three reasons: privacy, product testing and training machine learning algorithms. We have established a partnership with replicaAnalytics to work on this subject
Methods and applications:
Businesses face a trade-off between data privacy and data utility while selecting a privacy-enhancing technology. Therefore, they need to determine the priorities of their use case before investing. Synthetic data does not contain any personal information, it is a sample data that has a similar distribution with original data. Though the utility of synthetic data can be lower than real data in some cases, there are also cases where synthetic data is almost as valuable as real data. Our research in synthetic data will:
Using augmented data to explore risk factors for the local regional recurrence in colon cancer
Project is done in collaboration with ReplicaAnalytics. The first article based on this project is submitted.
Using data de-identification to increase data sharing among researchers
A systematic review has been conducted by OSRC team and submitted for publication.
Contact us if you have some knowledge/experience about the subject and wish to get involved
Where are we standing now and where we are heading to?
We have constructed the background knowledge. We are working to establish partnerships and raise funds to researchers and research equipment.