Probabilistic causal graphs as categorical data synthesizers: Do they do better than Gaussian Copulas and Conditional Tabular GANs?
Olha Shaposhnyk, Noor Abid, Mouri Zakir, Svetlana Yanushkevich·April 15, 2025
Summary
A study evaluates synthetic categorical data generation using probabilistic causal graphs for accessibility services data, focusing on people with disabilities. It compares methods like Gaussian copulas and CTGAN, finding Bayesian Networks superior for statistical metrics and privacy. The approach enhances data privacy, valuable for sensitive research. The study contributes synthetic data generation methods that reflect real-world patterns, supporting analysis while ensuring confidentiality.
Introduction
Background
Overview of accessibility services for people with disabilities
Importance of data in understanding and improving accessibility services
Objective
To evaluate synthetic categorical data generation methods using probabilistic causal graphs
To compare Gaussian copulas, CTGAN, and Bayesian Networks for data generation
To assess the methods' performance in terms of statistical metrics and privacy protection
Method
Data Collection
Description of the dataset used for the study
Criteria for selecting the dataset for accessibility services data
Data Preprocessing
Techniques for preparing the data for synthetic generation
Handling missing values, categorical variables, and ensuring data integrity
Model Selection
Overview of Gaussian copulas, CTGAN, and Bayesian Networks
Criteria for choosing Bayesian Networks for its superior performance in statistical metrics and privacy
Evaluation Metrics
Statistical metrics used to compare the generated data
Privacy assessment methods to evaluate data protection
Results
Performance Comparison
Statistical metrics analysis of Gaussian copulas, CTGAN, and Bayesian Networks
Privacy evaluation of the generated data
Real-World Pattern Reflection
Analysis of how well the synthetic data reflects real-world patterns in accessibility services
Privacy and Confidentiality
Detailed examination of the privacy protection offered by Bayesian Networks
Discussion on the implications for sensitive research involving people with disabilities
Discussion
Methodological Insights
Strengths and limitations of using Bayesian Networks for synthetic data generation
Comparison with Gaussian copulas and CTGAN in terms of efficiency and effectiveness
Practical Implications
Importance of the study for accessibility services research and development
Potential applications in enhancing data privacy and confidentiality in sensitive research areas
Future Directions
Suggestions for further research on synthetic data generation methods
Exploration of integrating other techniques or improving existing methods for better performance
Conclusion
Summary of Findings
Recap of the study's main results and their significance
Implications for Practice
Recommendations for practitioners in the field of accessibility services
Call for Further Research
Areas for future investigation to advance synthetic data generation techniques
Basic info
papers
artificial intelligence
Advanced features