By: Anisha Sagi
What is Synthetic data?
Synthetic data is crafted to emulate the statistical attributes of the original dataset, yet it does not expose the specific data points. It is an anonymous dataset that ensures high data utility and maximum privacy.
What is the need for Synthetic data generation?
During the software development lifecycle, developers and testers have limited or no access to usable production data for various reasons, such as restricted access, security, and compliance. Production or production-like data is critical to assess production-like data scenarios for effectively building data pipelines, applications, scenario testing, and model training. Synthetic data can help organizations mitigate compliance and security challenges, providing them with the data needed to support research and analysis and build and test applications while protecting data and customer privacy. Gartner’s research suggests that synthetic data may play a significant role in AI models by 2030.
Key Challenges:
Several challenges must be addressed to ensure the sound quality and usefulness of the generated synthetic data:
- Maintaining Data Diversity: Generating synthetic data that accurately represent real-world data while preserving diversity is complex, as overfitting or generating unrealistic data may limit the utility of synthetic data for research and AI model training.
- Preserving Privacy: Maintaining data, customer privacy, and confidentiality through data masking and anonymization, handling Personally Identifiable Information (PII) and sensitive data while ensuring the data have proper meaning and retaining specific production-like data scenarios.
- Generating Large Volumes of Scenario-Based Data for Testing: Obtaining large volumes of data into lower environments for performance testing, data science, modeling use cases, quality testing, etc.
- Managing Data Relationships: Maintaining interdependence and complex patterns between data patterns (e.g., maintaining relations between columns of a structured table)
- Costs: Accessing and provisioning real-world data requires significant resources, effort, and time due to the need for data anonymization, consent management, and compliance with legal and ethical requirements.
What are the types of Synthetic data?
- Structured Synthetic Data: Structured Synthetic data can be numeric or non-numeric data in a tabular format, stored in database tables, and used for analytical reporting. Examples of structured synthetic data include financial accounts and transactions, healthcare data, and customer data. Â
- Unstructured Synthetic Data: Unstructured synthetic data can be generated synthetically from any file, image, video, or other unstructured data and can be used for training models, testing models, etc. Examples of unstructured synthetic data include sample files created from claims records, tax files, credit memos, insurance policies, health reports, etc.
What is the approach to generating Synthetic data?
- Sample-based data generation: Sample-based data generation can generate data based on production, like sample values. Structured data reflects individual behavior. However, it might not reflect interdependence or complex relationships between columns, tables, or unstructured files.
- Schema-based data generation: Schema-based data generation involves creating synthetic data based on the underlying data model, referring to the data modeling structure, restrictions, underlying relationships, etc., and ensuring data is generated using physical constraints. However, due to the lack of data context and in-depth understanding of data, it might not reflect the underlying complexities of the column.
- Hybrid data generation: This reflects a hybrid of the above two models and can bring the best of both options by providing the ability to generate data within the data model boundaries while also ensuring the generated data depicts the complex data scenarios utilizing the production-like data scenarios submitted.

Generating Synthetic Data using Generative AI:
Generative AI employs artificial intelligence algorithms to produce data miming human-created content, including text, images, and virtual environments. As generative AI models evolve, synthetic data quality, diversity, and fidelity improvements can be expected to enable more accurate, versatile applications in research, AI model training, and other areas.
- Large Language Models (LLMs): LLMs, along with suitable prompts and few-shot examples, offer a more recent trend to a) generate sample data or b) generate programs that can then generate data.
- Generative Adversarial Networks (GANS): GANs are a popular class of generative models used for synthetic data generation. They pit two neural networks against each other: a generator that generates synthetic data and a discriminator that evaluates its authenticity by effectively capturing complex data distributions, column dependencies, and correlations. The two components can be trained simultaneously to improve the quality of the data generated over time.
- Variational Autoencoder (VAEs): VAEs are another generative model commonly used in synthetic data generation. They utilize an encoder and a decoder network. The encoder transforms the real data into a simplified representation, and the decoder reconstructs the data from this representation. VAEs can generate new synthetic data samples by sampling from this simplified space.
How do we evaluate generated Synthetic data?
Synthetic data can mirror real-world datasets without compromising privacy or security. However, ensuring its quality and integrity requires a meticulous approach to evaluation and measurement.
- Ensuring Comprehensive Coverage: The first step in evaluating synthetic data is to ensure comprehensive coverage. Synthetic data must replicate all relevant business scenarios and edge cases found in the original datasets, which involves understanding the key relationships and dependencies within data. Data scientists adhere to technical rules, data models, data samples, and business requirements to ensure that synthetic data maintains consistency and completeness while avoiding missing or inconsistent values.
- Achieving Statistical Similarity: To ensure similar distributions, statistics, such as means and standard deviations, between synthetic and real data must be compared. Cumulative distribution function (CDF) plots help visualize cumulative probabilities to confirm that generated synthetic data accurately captures real distribution. Chi-square tests assess categorical data similarity, while histograms and density plots help identify discrepancies. Additionally, correlation metrics help preserve relationships and dependencies.
- Evaluating Efficiency Metrics: Time efficiencies are measured by comparing the total time to generate synthetic data against existing processes, factoring in data generation time and development hours. Cost efficiencies use baseline cost metrics for benchmarking to evaluate the technical costs of generating synthetic data compared to utilizing existing tools.
These evaluations ensure that synthetic data is robust, reliable, and usable, thus enabling organizations to leverage synthetic data confidently to drive innovation and insights in data science.
Published by Joseph T.