Synthetic Data Quality: Metrics Beyond “Looks Real”

Define success by downstream tasks

The best metric is simple: does the model trained with synthetic data perform better on a realistic validation set? If not, the synthetic set is noise.

Track coverage (rare cases), fidelity (plausibility), and privacy (leakage risk) as separate dimensions.

Avoid leakage

If synthetic data is generated from sensitive sources, you need leakage tests and strict access control.

A practical approach is to generate from abstracted schemas and distributions, not raw user records.