Title: Synthetic Tabular Data Generation in the Medical Domain
Duration: 01.01.2023 – 31.12.2025
Research Area: Medical Informatics, Synthetic data, Generative AI
In clinical research, acquiring large-scale, high-quality patient data is challenging due to cost, privacy, and regulatory issues. This project “Synthetic Tabular Data Generation in the Medical Domain” addresses these concerns by using current generative AI methods, such as CTAB-GAN+, Normalizing Flows (NFlow) among others, to create synthetic datasets from clinical trial patient data. We seek to provide data that successfully replicates key patient data aspects and maintains inter-variable relationships, enabling actual exploratory analysis. For this reason, the synthetic data cohort needs to closely mirror original survival curves and still needs to ensure privacy by preventing patient re-identification. This work will provide models for synthetic data generation in Medicine and will offers access to the generated datasets for research.
An addressed research problem in the Medical Informatics domain is the challenge of safely and effectively utilizing patient data for clinical research. This includes overcoming the hurdles of high costs, privacy concerns, and regulatory constraints associated with accessing real patient data. Developing methodologies for synthetic data generation, as seen in our medical application scenario involving acute myeloid leukemia patients, exemplifies a solution to this problem.
Our currently provided Zenodo dataset comprises synthesized patient data for 1606 acute myeloid leukemia patients, generated using two generative AI methods: CTAB-GAN+ and Normalizing Flows (NFlow). This data is based on patients treated in four multi-center clinical trials and includes 1606 synthetic patients for each model, offering a valuable resource for medical research, particularly in exploring the application of AI in healthcare data analysis.
Among others, we utilize generative AI, such as CTAB-GAN+ and Normalizing Flows (NFlow). Both methods are designed to create realistic synthetic patient data, effectively replicating essential patient data characteristics while preserving relationships between variables.
By successfully generating synthetic patient data, the project “Synthetic Tabular Data Generation in the Medical Domain” will contribute to pave the way for overcoming privacy and data accessibility issues in medical studies. This could lead to more robust and diverse clinical research, especially for rare diseases, without compromising patient privacy. The approach could be expanded to other areas of medical research, enhancing the scope and depth of studies and cohorts, while ensuring data security and compliance with regulations.