Singapore’s PDPC Issues Guide on Synthetic Data Generation

On 15 July 2024, Singapore’s Personal Data Protection Commission (PDPC) published the “Privacy Enhancing Technology (PET): Proposed Guide on Synthetic Data Generation”. This guide, developed with industry collaboration, aims to provide organizations with comprehensive recommendations on creating and using synthetic data while ensuring privacy and compliance with data protection regulations.

Synthetic data, created through mathematical models or AI/ML algorithms, mimics the characteristics and structure of real data without revealing personal information. This guide emphasizes the potential of synthetic data to drive innovation, enhance AI model training, and facilitate data sharing and software testing. However, it also warns of the inherent re-identification risks and stresses the need for robust data protection measures.

Key Applications

The guide identifies several key applications for synthetic data:

  • AI Model Training: Synthetic data can augment training datasets, especially when real data is sparse or expensive to obtain. For example, J.P. Morgan successfully used synthetic data to improve fraud detection models.
  • Data Sharing and Analysis: Synthetic data enables data sharing in sensitive sectors like healthcare without compromising privacy. Johnson & Johnson utilized synthetic data to improve healthcare data analysis for external research.
  • Software Testing: Using synthetic data in development environments helps prevent data breaches and protects sensitive production data.

Recommendations

The PDPC recommends a structured approach to generating synthetic data:

  1. Know Your Data: Understand the source data and identify the necessary insights and attributes to preserve.
  2. Prepare Your Data: Apply data minimization, remove direct identifiers, and document data attributes in a data dictionary.
  3. Generate Synthetic Data: Use appropriate methods like Bayesian networks or GANs, ensuring data integrity, fidelity, and utility.
  4. Assess Re-identification Risks: Evaluate the risk of re-identification using methods like attribution disclosure and membership disclosure.
  5. Manage Residual Risks: Implement governance, technical, and contractual controls to mitigate any remaining risks.

This guide is meant as a living document, set to evolve with advancements in synthetic data technologies and methodologies. You can find it here.

♻️ Share this if you found it useful.
💥 Follow me on Linkedin for updates and discussions on privacy, digital and AI education.
📍 Subscribe to my newsletter for weekly updates and insights – subscribers get an integrated view of the week and more information than on the blog.

Scroll to Top