AI & GDPR: When Does Model Training Cross the Line?

AI & GDPR: When Does Model Training Cross the Line?

Can AI models be GDPR-compliant if trained on scraped web data? What if they retain personal information — even unintentionally? As AI development accelerates, data protection authorities are finalizing guidance that directly impacts how organizations build, deploy, and maintain AI systems.

The EDPB’s Landmark Opinion on AI Models

The European Data Protection Board (EDPB) Opinion 28/2024, adopted on December 17, 2024, brings critical clarity to AI developers and deployers. This guidance specifically addresses data protection aspects in the context of AI models, particularly regarding:

  • When AI models can be considered anonymous
  • When legitimate interest can serve as a valid legal basis
  • How unlawful initial training data affects downstream deployment

Key Insights from the EDPB Opinion

1. Not All AI Models Are Equal Under GDPR

The EDPB distinguishes between two categories of AI models:

  • Explicit Data-Providing Models: These are specifically designed to provide personal data about individuals whose data was used for training. These always fall under GDPR’s scope.
  • Implicit Data-Embedding Models: These aren’t intentionally designed to produce personal data, but personal data from training may still be embedded in the model, potentially extractable through targeted prompts.

For the second category, a case-by-case assessment is necessary to determine if the model can be considered “anonymous” and thus outside GDPR’s scope.

2. “Anonymous” Has a High Bar for AI Models

For an AI model to be considered truly anonymous, supervisory authorities must have sufficient evidence that, with reasonable means:

  • Personal data from the training set cannot be extracted from the model
  • The model’s output does not relate to individuals whose data was used for training

This assessment must consider numerous factors, including:

  • Characteristics of the training data, the AI model, and the training procedure
  • Costs and time required to obtain such information
  • Context in which the AI model is released or processed

Practical implication: Simply claiming your model is “anonymous” is insufficient. Documentation of risk assessment and mitigation measures is crucial.

3. Documentation Is Essential

The EDPB provides a non-exhaustive list of documentation that can help demonstrate compliance:

  • Information on Data Protection Impact Assessments (DPIAs)
  • Advice or feedback from the Data Protection Officer
  • Technical and organizational measures taken during AI model design
  • Documentation demonstrating the AI model’s theoretical resistance to re-identification techniques
  • Information provided to deploying controllers and data subjects

Legitimate Interest Assessment for AI

The EDPB acknowledges that legitimate interest can be an appropriate legal basis for processing personal data for AI development and deployment—but only following a rigorous three-step assessment:

Step 1: Identifying Legitimate Interests

An interest can be considered legitimate if it is:

  • Lawful
  • Clearly articulated
  • Real and present (not speculative)

Step 2: Necessity Test

This involves evaluating:

  • Whether the processing effectively pursues the purpose
  • Whether the volume of data processed is proportionate
  • If there are less intrusive alternatives

Step 3: Balancing Test

The controller’s interests must not override the data subject’s fundamental rights and freedoms. Factors to consider include:

  • The impact of processing on data subjects (e.g., sensitive information, vulnerable subjects)
  • The reasonable expectations of data subjects
  • Volume of data processed

Mitigating Measures for Different AI Phases

The EDPB outlines several mitigation measures that can shift the balancing test in favor of legitimate interest:

Development Phase

  • Pseudonymization
  • Data masking (e.g., replacing real names with fake ones)

Web Scraping Context

  • Not collecting certain data categories
  • Excluding data content from publications that poses risks to individuals

Deployment Phase

  • Preventing storage or generation of personal data
  • Implementing output filters

When Training Violation Affects Deployment Legality

A critical consideration: how does an initial GDPR violation during AI model development affect its future use? The EDPB identifies three scenarios:

Scenario 1: Same Controller

If a controller unlawfully processes personal data to develop an AI model and retains that data, the unlawfulness may affect subsequent processing. Each case requires individual assessment.

Scenario 2: Different Controller

If another controller deploys a model developed with unlawfully processed data, they may also be implicated. The second controller should conduct due diligence on the model’s development process.

Scenario 3: Subsequent Anonymization

If the model is genuinely anonymized before deployment, the GDPR no longer applies. However, this requires robust verification that no personal data remains accessible.

Practical Recommendations for Organizations

For AI Developers

  1. Document extensively: Create and maintain records of all measures taken to ensure GDPR compliance throughout the AI development lifecycle.
  2. Design with anonymization in mind: Implement techniques to minimize retention of personal data within models.
  3. Be transparent: Provide comprehensive information to deployers about how the model was developed and what data it may contain.
  4. Establish clear roles: Define responsibilities between developers and deployers regarding data protection.

For AI Deployers

  1. Conduct due diligence: Request information about the development process before entering into contracts.
  2. Establish contractual safeguards: Ensure agreements address data protection responsibilities.
  3. Perform a DPIA: Assess the specific risks of deploying the AI system in your context.
  4. Implement additional safeguards: Consider deployer-specific measures to further protect personal data.

The Interplay Between AI Act and GDPR

The European AI regulatory landscape is becoming increasingly complex with the AI Act (effective August 2024) working in parallel with the GDPR. Key considerations include:

  • The AI Act does not replace or supersede GDPR but works alongside it
  • For training high-risk AI systems, the AI Act introduces a specific legal basis for processing sensitive data to detect and correct bias (Article 10(5))
  • Both the GDPR and AI Act may require impact assessments (DPIA and FRIA) for certain AI applications

Conclusion

AI developers and deployers face significant challenges in navigating GDPR compliance. The EDPB Opinion provides crucial guidance, but also underscores the need for:

  • Robust documentation
  • Case-by-case risk assessment
  • Clear definition of roles and responsibilities
  • Appropriate technical and organizational safeguards

As AI technologies continue to evolve, the legal landscape will undoubtedly follow. Organizations developing or deploying AI systems must stay informed of regulatory developments and proactively implement compliance measures.


Need help ensuring your AI systems comply with GDPR and the AI Act? Contact our privacy experts for customized guidance and practical compliance tools.

Comments are closed.

Get in Touch with Our Privacy Experts

Schedule a Free Consultation

Looking to enhance your data privacy strategy and achieve GDPR & AI compliance? Our experts are here to guide you with tailored solutions. Contact us today and take the next step toward secure and compliant data practices.

  • 24/7 Support
  • Confidence that you are compliant
  • Regulatory Privacy Compliance

Ready to start your data privacy & AI compliance journey?

Fill in your details below and we will get back to you as soon as possible

    AI & GDPR: When Does Model Training Cross the Line?

    AI & GDPR: When Does Model Training Cross the Line?

    Can AI models be GDPR-compliant if trained on scraped web data? What if they retain personal information — even unintentionally? As AI development accelerates, data protection authorities are finalizing guidance that directly impacts how organizations build, deploy, and maintain AI systems.

    The EDPB's Landmark Opinion on AI Models

    The European Data Protection Board (EDPB) Opinion 28/2024, adopted on December 17, 2024, brings critical clarity to AI developers and deployers. This guidance specifically addresses data protection aspects in the context of AI models, particularly regarding:

    • When AI models can be considered anonymous
    • When legitimate interest can serve as a valid legal basis
    • How unlawful initial training data affects downstream deployment

    Key Insights from the EDPB Opinion

    1. Not All AI Models Are Equal Under GDPR

    The EDPB distinguishes between two categories of AI models:

    • Explicit Data-Providing Models: These are specifically designed to provide personal data about individuals whose data was used for training. These always fall under GDPR's scope.
    • Implicit Data-Embedding Models: These aren't intentionally designed to produce personal data, but personal data from training may still be embedded in the model, potentially extractable through targeted prompts.

    For the second category, a case-by-case assessment is necessary to determine if the model can be considered "anonymous" and thus outside GDPR's scope.

    2. "Anonymous" Has a High Bar for AI Models

    For an AI model to be considered truly anonymous, supervisory authorities must have sufficient evidence that, with reasonable means:

    • Personal data from the training set cannot be extracted from the model
    • The model's output does not relate to individuals whose data was used for training

    This assessment must consider numerous factors, including:

    • Characteristics of the training data, the AI model, and the training procedure
    • Costs and time required to obtain such information
    • Context in which the AI model is released or processed

    Practical implication: Simply claiming your model is "anonymous" is insufficient. Documentation of risk assessment and mitigation measures is crucial.

    3. Documentation Is Essential

    The EDPB provides a non-exhaustive list of documentation that can help demonstrate compliance:

    • Information on Data Protection Impact Assessments (DPIAs)
    • Advice or feedback from the Data Protection Officer
    • Technical and organizational measures taken during AI model design
    • Documentation demonstrating the AI model's theoretical resistance to re-identification techniques
    • Information provided to deploying controllers and data subjects

    Legitimate Interest Assessment for AI

    The EDPB acknowledges that legitimate interest can be an appropriate legal basis for processing personal data for AI development and deployment—but only following a rigorous three-step assessment:

    Step 1: Identifying Legitimate Interests

    An interest can be considered legitimate if it is:

    • Lawful
    • Clearly articulated
    • Real and present (not speculative)

    Step 2: Necessity Test

    This involves evaluating:

    • Whether the processing effectively pursues the purpose
    • Whether the volume of data processed is proportionate
    • If there are less intrusive alternatives

    Step 3: Balancing Test

    The controller's interests must not override the data subject's fundamental rights and freedoms. Factors to consider include:

    • The impact of processing on data subjects (e.g., sensitive information, vulnerable subjects)
    • The reasonable expectations of data subjects
    • Volume of data processed

    Mitigating Measures for Different AI Phases

    The EDPB outlines several mitigation measures that can shift the balancing test in favor of legitimate interest:

    Development Phase

    • Pseudonymization
    • Data masking (e.g., replacing real names with fake ones)

    Web Scraping Context

    • Not collecting certain data categories
    • Excluding data content from publications that poses risks to individuals

    Deployment Phase

    • Preventing storage or generation of personal data
    • Implementing output filters

    When Training Violation Affects Deployment Legality

    A critical consideration: how does an initial GDPR violation during AI model development affect its future use? The EDPB identifies three scenarios:

    Scenario 1: Same Controller

    If a controller unlawfully processes personal data to develop an AI model and retains that data, the unlawfulness may affect subsequent processing. Each case requires individual assessment.

    Scenario 2: Different Controller

    If another controller deploys a model developed with unlawfully processed data, they may also be implicated. The second controller should conduct due diligence on the model's development process.

    Scenario 3: Subsequent Anonymization

    If the model is genuinely anonymized before deployment, the GDPR no longer applies. However, this requires robust verification that no personal data remains accessible.

    Practical Recommendations for Organizations

    For AI Developers

    1. Document extensively: Create and maintain records of all measures taken to ensure GDPR compliance throughout the AI development lifecycle.
    2. Design with anonymization in mind: Implement techniques to minimize retention of personal data within models.
    3. Be transparent: Provide comprehensive information to deployers about how the model was developed and what data it may contain.
    4. Establish clear roles: Define responsibilities between developers and deployers regarding data protection.

    For AI Deployers

    1. Conduct due diligence: Request information about the development process before entering into contracts.
    2. Establish contractual safeguards: Ensure agreements address data protection responsibilities.
    3. Perform a DPIA: Assess the specific risks of deploying the AI system in your context.
    4. Implement additional safeguards: Consider deployer-specific measures to further protect personal data.

    The Interplay Between AI Act and GDPR

    The European AI regulatory landscape is becoming increasingly complex with the AI Act (effective August 2024) working in parallel with the GDPR. Key considerations include:

    • The AI Act does not replace or supersede GDPR but works alongside it
    • For training high-risk AI systems, the AI Act introduces a specific legal basis for processing sensitive data to detect and correct bias (Article 10(5))
    • Both the GDPR and AI Act may require impact assessments (DPIA and FRIA) for certain AI applications

    Conclusion

    AI developers and deployers face significant challenges in navigating GDPR compliance. The EDPB Opinion provides crucial guidance, but also underscores the need for:

    • Robust documentation
    • Case-by-case risk assessment
    • Clear definition of roles and responsibilities
    • Appropriate technical and organizational safeguards

    As AI technologies continue to evolve, the legal landscape will undoubtedly follow. Organizations developing or deploying AI systems must stay informed of regulatory developments and proactively implement compliance measures.


    Need help ensuring your AI systems comply with GDPR and the AI Act? Contact our privacy experts for customized guidance and practical compliance tools.

      Thank you for registering!

      Your download is ready, click the button below.