Synthetic Data as a Privacy Tool: The Unsettled Legal Threshold, the Attack Surface, and How to Test Both
Apr 3, 2026
Article by

The governance case for synthetic data poses a lacuna where real data carries legal risk and artificially generated data does not. Train a model on patient records, produce outputs that share none of them, and the privacy problem dissolves. Organisations have built compliance strategies around precisely this logic deploying synthetic datasets to share data freely across borders, sidestep consent requirements, and narrow the scope of security obligations. The argument is not without merit. It also, in most implementations, proves too much.
Neither the GDPR nor India's DPDP Act 2023 addresses synthetic data directly. Both frameworks define personal data, an anonymisation standard, and a set of processing obligations. Whether any given synthetic dataset falls inside or outside those obligations cannot be answered by pointing to how the data was generated. Identifiability is determined by what a realistic adversary can do, not by what the organisation intended to produce. That gap between intent and identifiability is where most synthetic data compliance frameworks quietly fail.
Three questions structure the analysis that follows. First, whether the training of generative models on real personal data is itself a processing activity requiring a lawful basis at the input stage. Second, whether the output constitutes personal data and under which framework. Third, what testing regime is adequate to assess that output. Most compliance guidance treats the first question as settled and the second as resolved by the choice of generation technique. Both assumptions are wrong, and the third question is rarely examined with the rigour it demands.
The Input Problem: Processing Before the Synthetic Data Exists
The compliance conversation about synthetic data always begins with the output. Before any synthetic record is generated, however, a generative model must be trained on real personal data and that training constitutes processing within the meaning of Article 4(2) GDPR and Section 2(x) of the DPDP Act, which defines processing broadly as any operation performed on digital personal data. A lawful basis is needed for the training phase regardless of what the output subsequently looks like.
Under the GDPR, the two most plausible bases are Article 6(1)(f) (legitimate interests), and, for research contexts, which permits processing for statistical or scientific purposes subject to appropriate safeguards such as anonymisation or pseudonymisation. Article 89(1) is not a blanket exemption. Where training data includes health records, biometric profiles, or data revealing sexual orientation all special category data under Article 9 neither basis operates without difficulty. Purpose limitation under Article 5(1)(b) compounds the problem: data originally collected for one purpose cannot automatically be repurposed for synthetic data generation for AI development simply because the output will not be shared directly.
To illustrate how this works in practice: a private hospital collects patient records for the purpose of providing medical care. It later contracts an AI company to build a predictive discharge model and decides to generate a synthetic training dataset from those records rather than share live patient data. The hospital's original consent covered treatment and related clinical purposes. Under the Article 29 Working Party's compatibility analysis in Article 6(4) GDPR, the question is whether synthetic data generation for AI is sufficiently connected to that original purpose. The relevant factors include the link between the two purposes, the context in which the data was originally collected, the nature of the data, the possible consequences of the secondary processing, and whether appropriate safeguards are in place. Health data fails this analysis at the third factor: Article 9(2) requires an explicit legal ground for secondary processing of special category data, and AI model development does not appear among those grounds. The hospital cannot rely on the original clinical consent. It needs either explicit patient consent for AI training specifically, or a legal basis under Article 9(2)(i) scientific research alongside the Article 89(1) safeguards. Obtaining that consent retrospectively from an existing patient population is operationally impractical in most cases. The synthetic data eventually produced may be technically excellent. The legal exposure existed before a single synthetic record was generated.
The DPDP Act is structurally less accommodating still. Under Section 4, processing must be limited to the purpose for which consent was given. The Act contains no equivalent to GDPR's legitimate interests' basis, and no research or statistical purposes carve-out analogous to Article 89. A Data Fiduciary seeking to train a synthetic generator on customer or patient records must either secure fresh, purpose-specific consent which existing collection frameworks will almost never cover or fit the activity within one of the narrow legitimate uses enumerated in Section 5. Section 5 does not obviously accommodate commercial AI development. The practical result is that many organisations training generative models on personal data are doing so without a clearly established lawful basis for the training phase itself.
The Output Problem: When Does Synthetic Data Remain Personal Data?
The GDPR's Anonymisation Standard and the WP29 Framework
The GDPR does not regulate anonymous information. takes anonymised data outside the regulation's scope where individuals are rendered non-identifiable, accounting for all means likely to be used to reverse that process. In principle, synthetic data that genuinely satisfies this standard escapes the regulation entirely no lawful basis required for sharing, no data subject rights to honour, no breach notification obligations. The difficulty is that the standard is not satisfied by selecting a generation technique. It requires demonstrating that re-identification is not reasonably feasible against a realistic threat model.
The analytical framework for that demonstration is the Article 29 Working Party's Opinion 05/2014 on Anonymisation Techniques. The Opinion identifies three distinct risks that must all be addressed: singling out (isolating an individual's record in the dataset); linkability (connecting records across datasets); and inference (deducing sensitive attributes from statistical correlations). A technique that eliminates singling-out while permitting attribute inference does not satisfy Recital 26. K-anonymity blocks singling-out but fails the inference test. L-diversity and t-closeness improve on this but remain vulnerable to linkability across external datasets.
Synthetic data handles these risks differently from classical anonymisation techniques and not straightforwardly better. Because synthetic records are generated rather than directly derived from any single individual, singling-out risk is theoretically lower. Linkability and inference risks are more persistent. A generative model that faithfully captures high-dimensional statistical correlations which is precisely the goal because utility depends on it reproduces the attribute relationships that inference attacks exploit. The WP29 Opinion was published in 2014 and does not address the risk that a model may memorise individual training records in its weights independently of its output distributions. That is the basis for the membership inference attack category examined in Section 3.
The Breyer Problem: Whose Assessment of Identifiability Counts?
The CJEU's judgment in Case C-582/14, Breyer v Bundesrepublik Deutschland (October 2016) is frequently cited in discussions of IP addresses. Its broader significance concerns the identifiability assessment itself. The Court held that data constitutes personal data where the controller has legal means enabling re-identification even where the identifying information is held by a third party. Personal data is therefore not defined by what the generating organisation can do with it. The relevant question is what a party with access to available auxiliary data could do.
Consider a fintech company operating in India that generates a synthetic transaction dataset from its customer payment records and concludes through internal assessment that the dataset is not personal data: its own data science team, working only with the synthetic output, cannot identify any customer. The synthetic dataset is then shared with an overseas analytics vendor that has existing commercial relationships with three credit bureaus holding name, address, and transaction history data for approximately 60 million Indian consumers. The vendor cross-references the synthetic transaction patterns against that auxiliary data. The synthetic dataset preserves spending frequency, merchant category distributions, and approximate transaction value ranges. For a subset of records reflecting unusual combinations of merchant categories and value bands, the cross-reference narrows the candidate population to fewer than five individuals per record. Under the Breyer framework, the relevant question is not whether the fintech company could have performed this attribution, but whether the vendor with legal access to the credit bureau data could do so by means that are legally and practically available. The answer is yes. The fintech company's self-assessed identifiability conclusion was accurate within its own data environment. It was the wrong environment to assess.
The DPDP’s Structural Gap
The DPDP Act contains no Recital 26 equivalent. The definition of personal data in Section 2(t) 'any data about an individual who is identifiable by or in relation to such data' is broad, and the Act does not carve out anonymised or synthetic data from its scope. Where the GDPR gives controllers a clear target (satisfy Recital 26 and the regulation does not apply), the DPDP Act offers no equivalent destination. Until the Data Protection Board of India issues guidance on anonymisation thresholds, the only defensible position is to treat synthetic data derived from personal data as retaining personal data status, with the full weight of Section 8 obligations remaining in force including the reasonable security safeguards requirement under Section 8(5), breach of which attracts penalties of up to Rs. 250 crores. No statutory safe harbour currently supports any other position.
An Indian e-commerce platform intending to share a synthetic customer behaviour dataset freely with logistics partners for route optimisation illustrates the gap clearly. Under the GDPR, that argument has a legal destination: demonstrate that the Recital 26 anonymisation standard is satisfied and the data falls outside the regulation. Under the DPDP Act, there is no equivalent destination. The platform cannot point to a statutory provision confirming that the synthetic dataset falls outside the Act. If any logistics partner can re-identify records using auxiliary data the platform did not possess, the synthetic data may retain personal data status, and the sharing may constitute a transfer without adequate safeguards. The Section 8(6) breach notification duty remains in force. Organisations using synthetic data specifically to reduce their DPDP compliance obligations should seek Board guidance before relying on that position commercially.
The Attack Surface: Where Synthetic Data Fails
Membership Inference
A membership inference attack does not target the synthetic dataset itself. It targets the model that produced the dataset. The real question is whether a person’s data was used to train the model, which makes the model weights privacy-relevant too.
This matters because generative models, especially GANs, can memorize unusual records from the training data. If a person’s data is highly distinctive, the model may reproduce it more easily. That means even if the synthetic output looks safe, the model itself may still leak information. For example, a healthcare company that trains a GAN on patient records and then releases the model could expose the training population to membership inference, even if the synthetic records pass manual review.
Linkage Attacks
Synthetic data is often designed to preserve the statistical patterns of real data. That is useful for analytics, but it also creates risk. If an attacker has access to outside datasets such as electoral rolls, insurance records, social media data, or commercial data broker files, they may be able to match synthetic records back to real people.
In other words, synthetic data may still be linkable when combined with external information. The attack does not require insider access; it only requires enough auxiliary data.
The Utility-Privacy Trade-Off
Differential privacy offers the strongest technical privacy protection for synthetic data generation. It works by adding noise, controlled through the epsilon and delta parameters. Lower epsilon generally means stronger privacy, but also lower utility.
That creates a legal and practical tension. A bank may say its synthetic fraud data is realistic enough to preserve rare fraud patterns, while also saying it is anonymised enough to share freely. Both claims cannot always be true at the same time. If the data preserves rare patterns, it may still be re-identifiable. If strong differential privacy is used to reduce that risk, the most useful patterns may be lost.
What a Defensible Testing Regime Requires
Testing should show whether synthetic data leaks real information. A basic report on generation method and summary statistics is not enough for GDPR or DPDP accountability.
Membership inference: Test whether an attacker can tell if a record was in the training set. If the advantage over random guessing is near zero, leakage is low.
Nearest-neighbour analysis: Check whether synthetic records are too close to real ones. Very close matches may show memorisation.
Linkage simulation: Try linking synthetic records with outside datasets an attacker might have. Use an external-adversary perspective.
Differential privacy: If maximum protection is needed, document the epsilon and delta values, the privacy budget, and the utility trade-off.The WP29 three-risk assessment requires running a structured analysis of singling-out, linkability, and inference risk for each synthetic dataset in line with Opinion 05/2014. Each risk category demands a separate technical analysis calibrated to the specific dataset and the external data environment not a general assertion that the generation technique is privacy-preserving.
DPIA integration is required under Article 35 GDPR and, for Significant Data Fiduciaries, where processing is likely to result in substantial risk to individuals. Training generative models on health, financial, or biometric data meets this threshold. The DPIA must cover the input phase, the model architecture, and the residual risk profile of the output.
GoTrust: Implementation and Compliance
GoTrust's compliance automation addresses synthetic data governance at both the input and output phases a distinction most compliance platforms do not make.
At the input stage, GoTrust maps generative model training pipelines against the lawful basis requirements under both frameworks. Where a Data Fiduciary is processing personal data to train a synthesiser, the platform identifies whether the processing falls within the applicable consent or legitimate use frameworks, flags purpose-limitation risks where the secondary use was not covered in the original collection notice, and surfaces special category data implications under Article 9 GDPR for sensitive training datasets. Purpose-limitation analysis at this stage is the highest-value governance step for most organisations and the most consistently omitted. At the output stage, GoTrust integrates the WP29 three-risk assessment framework into its DPIA workflow, generating structured documentation of singling-out, linkability, and inference risk for each synthetic dataset against the Recital 26 standard. For clients operating under the DPDP Act, the platform tracks the absence of a statutory anonymisation safe harbour and flags datasets where residual identifiability risk triggers continued obligations under Section 8(5).
For Significant Data Fiduciaries under Section 10 of the DPDP Act, GoTrust automates the annual DPIA obligation and links it to breach notification workflows under Rule 7 of the DPDP Rules, 2025.
GoTrust additionally maintains audit trails for consent records linked to generative model training, cross-border transfer assessments where synthetic data is shared with third-party vendors or cloud providers, and the epsilon and delta parameters used in differentially private synthesisers creating a documented record that can be produced to the Data Protection Board or a supervisory authority on request rather than reconstructed after an incident.
Conclusion
Synthetic data resolves genuine problems. Used carefully, it narrows breach exposure, reduces reliance on live personal data in development environments, and creates a credible path toward GDPR anonymisation compliance. The legal case for treating it as a general-purpose compliance shortcut is weaker. Two issues underlie most of the positions in this space. The first is treating the generation method as the answer to the identifiability question, rather than the starting point for it. The second is conducting the identifiability assessment from the inside using the generating organisation's own knowledge of available auxiliary data rather than from the position of a realistic external adversary established that the latter is the legally relevant perspective. The WP29's three-risk framework established what that assessment must cover. Neither was designed around synthetic data, and the field has moved since 2014; both remain the authoritative basis on which regulators and courts will evaluate compliance claims.
The input-phase obligations are the most consistently overlooked. An organisation that trains a generative model on personal data without a valid lawful basis, without purpose-limited consent, and without a completed DPIA has incurred regulatory exposure before generating a single synthetic record. Whether the output constitutes personal data is a secondary question and a harder one to answer cleanly than is typically acknowledged.
