Who should use the Generate synthetic data workflow?
Teams or solo builders working on development tasks who want a repeatable process instead of one-off tool experiments.
AI Workflow · Development
Practical execution plan for generate synthetic data with clear steps, mapped tools, and delivery-focused outcomes.
Deliverable outcome
A fully documented, packaged synthetic dataset delivered to the end user with clear instructions for use.
30-90 minutes
Includes setup plus initial result generation
Free to start
You can swap tools by pricing and policy requirements
A fully documented, packaged synthetic dataset delivered to the end user with clear instructions for use.
Use each step output as the input for the next stage
Step map
Instead of relying on a single generic AI model, this pipeline connects specialized tools to maximize quality. First, you'll use Google AppSheet AI to a complete schema document and quality criteria that guide all subsequent generation steps. Then, you pass the output to YData Fabric to a statistical profile of the real data that informs the generation model parameters and validation thresholds. Then, you pass the output to NVIDIA NeMo Data Designer to a configured generation pipeline ready to produce synthetic records, with all dependencies and parameters set. Then, you pass the output to NVIDIA NeMo Data Designer to a raw synthetic dataset of the required size and format, ready for validation and refinement. Then, you pass the output to NVIDIA NeMo Data Designer to a validated synthetic dataset that meets predefined quality metrics and schema constraints. Then, you pass the output to Tonic AI to a privacy-preserving synthetic dataset that balances utility and compliance requirements. Finally, dbt Cloud (AI-Powered) is used to a fully documented, packaged synthetic dataset delivered to the end user with clear instructions for use.
Define data requirements and schema
A complete schema document and quality criteria that guide all subsequent generation steps.
Collect and analyze real data sample (optional)
A statistical profile of the real data that informs the generation model parameters and validation thresholds.
Select and configure generation method
A configured generation pipeline ready to produce synthetic records, with all dependencies and parameters set.
Generate initial synthetic dataset
A raw synthetic dataset of the required size and format, ready for validation and refinement.
Validate and refine synthetic data quality
A validated synthetic dataset that meets predefined quality metrics and schema constraints.
Apply privacy and masking (optional)
A privacy-preserving synthetic dataset that balances utility and compliance requirements.
Package and deliver synthetic data
A fully documented, packaged synthetic dataset delivered to the end user with clear instructions for use.
Start by specifying the domain, data types (tabular, text, image, time-series), and statistical properties (distributions, correlations, missing rates) of the target real-world dataset. Document the schema including field names, data types, value ranges, and any constraints (e.g., foreign keys, uniqueness). This blueprint ensures the synthetic data will be fit for purpose.
Why Google AppSheet AI: Google AppSheet AI directly supports Natural Language to SQL Schema generation, which is the core need for defining data requirements and schema.
If available, gather a representative sample of real data (anonymized if needed) to extract statistical patterns — distributions, correlations, and anomalies. Perform exploratory data analysis (EDA) using histograms, correlation matrices, and missing value analysis. This step is optional but highly recommended for high-fidelity synthetic data.
Why YData Fabric: YData Fabric provides data profiling, which directly matches the need to analyze a real data sample with tools like pandas-profiling.
Choose a synthetic data generation approach based on your data type and requirements: rule-based (e.g., Faker for simple fields), statistical models (e.g., Gaussian copula for tabular data), or deep learning (e.g., GANs, VAEs for images/text). Configure parameters such as model architecture, training epochs, and privacy settings (e.g., differential privacy epsilon).
Why NVIDIA NeMo Data Designer: NVIDIA NeMo Data Designer is specifically designed for synthetic data generation, matching the need to configure a generation method.
Execute the generation process to produce a dataset of the desired size (e.g., 100,000 rows or 5,000 images). For deep learning methods, train the model on the real data sample (if available) then generate. For rule-based methods, run scripts that produce records conforming to the schema. Save the output in a standard format (CSV, Parquet, JSON).
Why NVIDIA NeMo Data Designer: NVIDIA NeMo Data Designer directly performs synthetic data generation, which is the primary task of this step.
Compare the synthetic dataset against the real data profile (or schema constraints) using statistical tests (e.g., Kolmogorov-Smirnov for distributions, correlation difference). Check for missing patterns, duplicate rows, and edge case coverage. If quality metrics are not met, adjust generation parameters (e.g., increase training epochs, add constraints) and regenerate.
Why NVIDIA NeMo Data Designer: NVIDIA NeMo Data Designer includes model evaluation, which can be used to validate synthetic data quality.
If the synthetic data was derived from real data and needs to be shared publicly, apply additional privacy techniques such as differential privacy, k-anonymity, or field-level masking (e.g., hashing IDs, rounding values). This step is optional but critical for compliance with regulations like GDPR or HIPAA.
Why Tonic AI: Tonic AI explicitly offers data masking, which is the core need for applying privacy and masking.
Finalize the synthetic dataset by adding metadata (generation date, method, schema version, quality report). Package into a deliverable format (e.g., zip file with CSV and PDF report, or API endpoint). Upload to a shared location (S3 bucket, internal data portal) and notify stakeholders with usage documentation.
Why dbt Cloud (AI-Powered): dbt Cloud (AI-Powered) provides AI-generated documentation, which is a key part of packaging and delivering synthetic data.
§ Before you start
Teams or solo builders working on development tasks who want a repeatable process instead of one-off tool experiments.
No. Start with the top pick for each step, then replace tools only if they do not fit your pricing, compliance, or output needs.
Open the mapped task page and compare top options side by side. Prioritize output quality, integration fit, and predictable cost before scaling.
§ Related
Ship features faster by delegating architecture, implementation, testing, and deployment to specialized AI coding agents.
Rapidly prototype and deploy a functional application using AI-assisted coding and design systems — from idea to live product in days.
From logic definition to production-ready code with automated testing and deployment — a repeatable pipeline for shipping software features.