Home // Research // Applied AI and Big Data // Life Science and Medicine // Projects // Synthetic Tabular Data Generation in the Medical Domain

Synthetic Tabular Data Generation in the Medical Domain

Title: Synthetic Tabular Data Generation in the Medical Domain

Duration: 01.01.2023 – 31.12.2025

Research Area: Medical Informatics, Synthetic data, Generative AI

In clinical research, acquiring large-scale, high-quality patient data is challenging due to cost, privacy, and regulatory issues. This project “Synthetic Tabular Data Generation in the Medical Domain” addresses these concerns by using current generative AI methods, such as CTAB-GAN+, Normalizing Flows (NFlow) among others, to create synthetic datasets from clinical trial patient data. We seek to provide data that successfully replicates key patient data aspects and maintains inter-variable relationships, enabling actual exploratory analysis. For this reason, the synthetic data cohort needs to closely mirror original survival curves and still needs to ensure privacy by preventing patient re-identification. This work will provide models for synthetic data generation in Medicine and will offers access to the generated datasets for research.

Problem and Aims

An addressed research problem in the Medical Informatics domain is the challenge of safely and effectively utilizing patient data for clinical research. This includes overcoming the hurdles of high costs, privacy concerns, and regulatory constraints associated with accessing real patient data. Developing methodologies for synthetic data generation, as seen in our medical application scenario involving acute myeloid leukemia patients, exemplifies a solution to this problem.

Practical example created during the project (if applicable)

Our currently provided Zenodo dataset comprises synthesized patient data for 1606 acute myeloid leukemia patients, generated using two generative AI methods: CTAB-GAN+ and Normalizing Flows (NFlow). This data is based on patients treated in four multi-center clinical trials and includes 1606 synthetic patients for each model, offering a valuable resource for medical research, particularly in exploring the application of AI in healthcare data analysis.

Technology

Among others, we utilize generative AI, such as CTAB-GAN+ and Normalizing Flows (NFlow). Both methods are designed to create realistic synthetic patient data, effectively replicating essential patient data characteristics while preserving relationships between variables.

Outlook

By successfully generating synthetic patient data, the project “Synthetic Tabular Data Generation in the Medical Domain” will contribute to pave the way for overcoming privacy and data accessibility issues in medical studies. This could lead to more robust and diverse clinical research, especially for rare diseases, without compromising patient privacy. The approach could be expanded to other areas of medical research, enhancing the scope and depth of studies and cohorts, while ensuring data security and compliance with regulations.

Publications

ConvGeN: A convex space learning approach for deep-generative oversampling and imbalanced classification of small tabular datasets. Kristian Schultz, Saptarshi Bej, Waldemar Hahn, Markus Wolfien, Prashant Srivastava, Olaf Wolkenhauer (Pattern recognition 2024) https://doi.org/10.1016/j.patcog.2023.110138
Mimicking Clinical Trials with Synthetic Acute Myeloid Leukemia Patients Using Generative Artificial Intelligence. Jan-Niklas Eckardt, Waldemar Hahn, Christoph Röllig, Sebastian Stasik, Uwe Platzbecker, Carsten Müller-Tidow, Hubert Serve, Claudia D. Baldus, Christoph Schliemann, Kerstin Schäfer-Eckart, Maher Hanoun, Martin Kaufmann, Andreas Burchert, Christian Thiede, Johannes Schetelig, Martin Sedlmayr, Martin Bornhäuser, Markus Wolfien, Jan Moritz Middeke (Pre-print 2023) https://www.medrxiv.org/content/10.1101/2023.11.08.23298247v1
Word2Vec embeddings for categorical values in synthetic tabular generation. Waldemar Hahn, Martin Sedlmayr, Markus Wolfien (Conference: The 2022 International Conference on Computational Science and Computational Intelligence (CSCI)) http://dx.doi.org/10.1109/CSCI58124.2022.00201
Contribution of Synthetic Data Generation towards an Improved Patient Stratification in Palliative Care. Waldemar Hahn, Katharina Schütte, Kristian Schultz, Olaf Wolkenhauer, Martin Sedlmayr, Ulrich Schuler, Martin Eichler, Saptarshi Bej, Markus Wolfien (Journal of Personalized Medicine 2022) https://www.mdpi.com/2075-4426/12/8/1278

Team

Lead

Dr. Markus Wolfien

Team Members

Waldemar Hahn

funded by:

Gefördert vom Bundesministerium für Bildung und Forschung.

ScaDS.AI Dresden/Leipzig (Center for Scalable Data Analytics and Artificial Intelligence) is a center for Data Science, Artificial Intelligence and Big Data with locations in Dresden and Leipzig.