AI & GDPR: When Does Model Training Cross the Line?
Can AI models be GDPR-compliant if trained on scraped web data? What if they retain personal information — even unintentionally? As AI development accelerates, data protection authorities are finalizing guidance that directly impacts how organizations build, deploy, and maintain AI systems.
The EDPB’s Landmark Opinion on AI Models
The European Data Protection Board (EDPB) Opinion 28/2024, adopted on December 17, 2024, brings critical clarity to AI developers and deployers. This guidance specifically addresses data protection aspects in the context of AI models, particularly regarding:
- When AI models can be considered anonymous
- When legitimate interest can serve as a valid legal basis
- How unlawful initial training data affects downstream deployment
Key Insights from the EDPB Opinion
1. Not All AI Models Are Equal Under GDPR
The EDPB distinguishes between two categories of AI models:
- Explicit Data-Providing Models: These are specifically designed to provide personal data about individuals whose data was used for training. These always fall under GDPR’s scope.
- Implicit Data-Embedding Models: These aren’t intentionally designed to produce personal data, but personal data from training may still be embedded in the model, potentially extractable through targeted prompts.
For the second category, a case-by-case assessment is necessary to determine if the model can be considered “anonymous” and thus outside GDPR’s scope.
2. “Anonymous” Has a High Bar for AI Models
For an AI model to be considered truly anonymous, supervisory authorities must have sufficient evidence that, with reasonable means:
- Personal data from the training set cannot be extracted from the model
- The model’s output does not relate to individuals whose data was used for training
This assessment must consider numerous factors, including:
- Characteristics of the training data, the AI model, and the training procedure
- Costs and time required to obtain such information
- Context in which the AI model is released or processed
Practical implication: Simply claiming your model is “anonymous” is insufficient. Documentation of risk assessment and mitigation measures is crucial.
3. Documentation Is Essential
The EDPB provides a non-exhaustive list of documentation that can help demonstrate compliance:
- Information on Data Protection Impact Assessments (DPIAs)
- Advice or feedback from the Data Protection Officer
- Technical and organizational measures taken during AI model design
- Documentation demonstrating the AI model’s theoretical resistance to re-identification techniques
- Information provided to deploying controllers and data subjects
Legitimate Interest Assessment for AI
The EDPB acknowledges that legitimate interest can be an appropriate legal basis for processing personal data for AI development and deployment—but only following a rigorous three-step assessment:
Step 1: Identifying Legitimate Interests
An interest can be considered legitimate if it is:
- Lawful
- Clearly articulated
- Real and present (not speculative)
Step 2: Necessity Test
This involves evaluating:
- Whether the processing effectively pursues the purpose
- Whether the volume of data processed is proportionate
- If there are less intrusive alternatives
Step 3: Balancing Test
The controller’s interests must not override the data subject’s fundamental rights and freedoms. Factors to consider include:
- The impact of processing on data subjects (e.g., sensitive information, vulnerable subjects)
- The reasonable expectations of data subjects
- Volume of data processed
Mitigating Measures for Different AI Phases
The EDPB outlines several mitigation measures that can shift the balancing test in favor of legitimate interest:
Development Phase
- Pseudonymization
- Data masking (e.g., replacing real names with fake ones)
Web Scraping Context
- Not collecting certain data categories
- Excluding data content from publications that poses risks to individuals
Deployment Phase
- Preventing storage or generation of personal data
- Implementing output filters
When Training Violation Affects Deployment Legality
A critical consideration: how does an initial GDPR violation during AI model development affect its future use? The EDPB identifies three scenarios:
Scenario 1: Same Controller
If a controller unlawfully processes personal data to develop an AI model and retains that data, the unlawfulness may affect subsequent processing. Each case requires individual assessment.
Scenario 2: Different Controller
If another controller deploys a model developed with unlawfully processed data, they may also be implicated. The second controller should conduct due diligence on the model’s development process.
Scenario 3: Subsequent Anonymization
If the model is genuinely anonymized before deployment, the GDPR no longer applies. However, this requires robust verification that no personal data remains accessible.
Practical Recommendations for Organizations
For AI Developers
- Document extensively: Create and maintain records of all measures taken to ensure GDPR compliance throughout the AI development lifecycle.
- Design with anonymization in mind: Implement techniques to minimize retention of personal data within models.
- Be transparent: Provide comprehensive information to deployers about how the model was developed and what data it may contain.
- Establish clear roles: Define responsibilities between developers and deployers regarding data protection.
For AI Deployers
- Conduct due diligence: Request information about the development process before entering into contracts.
- Establish contractual safeguards: Ensure agreements address data protection responsibilities.
- Perform a DPIA: Assess the specific risks of deploying the AI system in your context.
- Implement additional safeguards: Consider deployer-specific measures to further protect personal data.
The Interplay Between AI Act and GDPR
The European AI regulatory landscape is becoming increasingly complex with the AI Act (effective August 2024) working in parallel with the GDPR. Key considerations include:
- The AI Act does not replace or supersede GDPR but works alongside it
- For training high-risk AI systems, the AI Act introduces a specific legal basis for processing sensitive data to detect and correct bias (Article 10(5))
- Both the GDPR and AI Act may require impact assessments (DPIA and FRIA) for certain AI applications
Conclusion
AI developers and deployers face significant challenges in navigating GDPR compliance. The EDPB Opinion provides crucial guidance, but also underscores the need for:
- Robust documentation
- Case-by-case risk assessment
- Clear definition of roles and responsibilities
- Appropriate technical and organizational safeguards
As AI technologies continue to evolve, the legal landscape will undoubtedly follow. Organizations developing or deploying AI systems must stay informed of regulatory developments and proactively implement compliance measures.
Need help ensuring your AI systems comply with GDPR and the AI Act? Contact our privacy experts for customized guidance and practical compliance tools.