Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques

Thomas Johnson, Sayed A. Mostafa

Research output: Contribution to journalArticlepeer-review

Abstract

The rapid accumulation and release of data have fueled research across various fields. While numerous methods exist for data collection and storage, data distribution presents challenges, as some datasets are restricted, and certain subsets may compromise privacy if released unaltered. Statistical disclosure control (SDC) aims to maximize data utility while minimizing the disclosure risk, i.e., the risk of individual identification. A key SDC method is data perturbation, with General Additive Data Perturbation (GADP) and Copula General Additive Data Perturbation (CGADP) being two prominent approaches. Both leverage multivariate normal distributions to generate synthetic data while preserving statistical properties of the original dataset. Given the increasing use of machine learning for data modeling, this study compares the performance of various machine learning models on GADP- and CGADP-perturbed data. Using Monte Carlo simulations with three data-generating models and a real dataset, we evaluate the predictive performance and robustness of ten machine learning techniques under data perturbation. Our findings provide insights into the machine learning techniques that perform robustly on GADP- and CGADP-perturbed datasets, extending previous research that primarily focused on simple statistics such as means, variances, and correlations.

Original languageEnglish
Pages (from-to)312-331
Number of pages20
JournalJournal of Data Science
Volume23
Issue number2
DOIs
StatePublished - Apr 2025

Keywords

  • data confidentiality
  • data perturbation
  • machine learning
  • predictive modeling
  • statistical disclosure control

Fingerprint

Dive into the research topics of 'Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques'. Together they form a unique fingerprint.

Cite this