OSRC Research

Synthetic data, data augmentation and developing the data de-identification:

Large datasets that are diverse and representative (of the heterogeneity of phenotypes in the gender, ethnicity and geography of the individuals or patients, and in the healthcare systems, workflows and equipment used) are necessary to develop and refine best practices in evidence-based medicine involving artificial intelligence. To overcome the paucity of annotated medical data in real-world settings, synthetic data are being increasingly used. Synthetic data can be created from perturbations using accurate forward models (that is, models that simulate outcomes given specific inputs), physical simulations or AI-driven generative models. Data augmentation is another but closely related field. Data augmentation is an essential part of the training process applied to deep learning models. The motivation is that a robust training process for deep learning models depends on large annotated datasets, which are expensive to be acquired, stored and processed.

Aims:

Synthetic data is artificial data that is created by using different algorithms that mirror the statistical properties of the original data but does not reveal any information regarding real people. Synthetic data generated with the purpose of preserving privacy, testing systems or creating training data for machine learning algorithms. But Synthetic data generation is critical since it is an important factor in the quality of synthetic data; for example, synthetic data that can be reverse engineered to identify real data would not be useful in privacy enhancement.

Our work with synthetic data is important for three reasons: privacy, product testing and training machine learning algorithms. We have established a partnership with replicaAnalytics to work on this subject

Methods and applications:

Businesses face a trade-off between data privacy and data utility while selecting a privacy-enhancing technology. Therefore, they need to determine the priorities of their use case before investing. Synthetic data does not contain any personal information, it is a sample data that has a similar distribution with original data. Though the utility of synthetic data can be lower than real data in some cases, there are also cases where synthetic data is almost as valuable as real data. Our research in synthetic data will:

  • Allow better use\early use of datasets.
  • Experiment with datasets without exposing the sensitive information.
  • Train algorithm on the datasets.

Using augmented data to explore risk factors for the local regional recurrence in colon cancer

Project is done in collaboration with ReplicaAnalytics. The first article based on this project is submitted.

Using data de-identification to increase data sharing among researchers

A systematic review has been conducted by OSRC team and submitted for publication.

Contact us if you have some knowledge/experience about the subject and wish to get involved

Using synthetic data to train data mining in healthcare data

The project is in the design phase.

Contact us if you have some knowledge/experience about the subject and wish to get involved

Open Source Research is a sandbox

Synthetic data research is in its infancy. OSRC would like to explore these research to develop new tools for medical research.

There are hundreds of open-source datasets to be explored in open source medical research. These datasets can be used to train algorithms and create research opportunities for medical  as well as IT students.

Skills in data mining research will need long training and a sandbox of data to experiment with.

ReplicaAnalytics is one of OSRC’s partners. This firm works mainly with synthetic data. 

ReplicaAnalytics

Where are we standing now and where we are heading to?

We have constructed the background knowledge. We are working to establish partnerships and raise funds to researchers and research equipment.