Probabilistic causal graphs as categorical data synthesizers: Do they do better than Gaussian Copulas and Conditional Tabular GANs?

Olha Shaposhnyk, Noor Abid, Mouri Zakir, Svetlana Yanushkevich·April 15, 2025

Summary

A study evaluates synthetic categorical data generation using probabilistic causal graphs for accessibility services data, focusing on people with disabilities. It compares methods like Gaussian copulas and CTGAN, finding Bayesian Networks superior for statistical metrics and privacy. The approach enhances data privacy, valuable for sensitive research. The study contributes synthetic data generation methods that reflect real-world patterns, supporting analysis while ensuring confidentiality.

Introduction

Background

Overview of accessibility services for people with disabilities

Importance of data in understanding and improving accessibility services

Objective

To evaluate synthetic categorical data generation methods using probabilistic causal graphs

To compare Gaussian copulas, CTGAN, and Bayesian Networks for data generation

To assess the methods' performance in terms of statistical metrics and privacy protection

Method

Data Collection

Description of the dataset used for the study

Criteria for selecting the dataset for accessibility services data

Data Preprocessing

Techniques for preparing the data for synthetic generation

Handling missing values, categorical variables, and ensuring data integrity

Model Selection

Overview of Gaussian copulas, CTGAN, and Bayesian Networks

Criteria for choosing Bayesian Networks for its superior performance in statistical metrics and privacy

Evaluation Metrics

Statistical metrics used to compare the generated data

Privacy assessment methods to evaluate data protection

Results

Performance Comparison

Statistical metrics analysis of Gaussian copulas, CTGAN, and Bayesian Networks

Privacy evaluation of the generated data

Real-World Pattern Reflection

Analysis of how well the synthetic data reflects real-world patterns in accessibility services

Privacy and Confidentiality

Detailed examination of the privacy protection offered by Bayesian Networks

Discussion on the implications for sensitive research involving people with disabilities

Discussion

Methodological Insights

Strengths and limitations of using Bayesian Networks for synthetic data generation

Comparison with Gaussian copulas and CTGAN in terms of efficiency and effectiveness

Practical Implications

Importance of the study for accessibility services research and development

Potential applications in enhancing data privacy and confidentiality in sensitive research areas

Future Directions

Suggestions for further research on synthetic data generation methods

Exploration of integrating other techniques or improving existing methods for better performance

Conclusion

Summary of Findings

Recap of the study's main results and their significance

Implications for Practice

Recommendations for practitioners in the field of accessibility services

Call for Further Research

Areas for future investigation to advance synthetic data generation techniques

Basic info

papers

artificial intelligence

Advanced features