In 2024, many organisations have been eager to look at how they can use the data they hold to debut or build on their artificial intelligence (AI) programme. Many are looking to use that data to train AI models, or fine-tune third-party models. That data will almost certainly include personal data, meaning that the processing must comply with the General Data Protection Regulation (GDPR) in the EU or UK.
X, formerly Twitter, recently looked to train its AI chatbot, Grok, with data from users’ posts. It has now agreed to suspend training following exchanges with Ireland’s Data Protection Commission. This article looks at recent events and points for controllers to consider when using personal data to train AI.
Training Grok
On 7 May 2024, X activated a new setting under ‘Settings and privacy’ on training for Grok, it’s AI chatbot. The setting, on by default, allowed users to toggle off to opt out. Users were presented with a choice over whether to “Allow your posts as well as your interactions, inputs, and results with Grok to be used for training and fine-tuning”.
Details of how to opt out are included on X’s website, which also flags that users can opt out by making their accounts private (i.e. only public posts will be used for training and fine-tuning Grok).
X also flags in its privacy notice that “We may use the information we collect and publicly available information to help train our machine learning or artificial intelligence models for the purposes outlined in this policy.” It appears to rely on legitimate interests (Article 6(1)(f)) to process data for training its models and provides information on its legitimate interests assessment on its website.
The Irish Data Protection Commission (DPC) made an application to the Irish High Court to require X to suspend processing. The case was heard on 8 August. X’s Global Government Affairs department has highlighted that the DPC’s order would have applied to all its AI models, not just Grok.
The DPC announced that X has agreed to suspend its processing of personal data contained in the public posts of X’s EU and EEA users. The agreement appears to have been reached without any order being made by the High Court, and a further hearing is scheduled on 4 September.
Lawyer Marco Scialdone filed an initial complaint with the DPC on behalf of Euroconsumers and Altroconsumo, which was followed by nine complaints by pressure group noyb. noyb’s complaints, on behalf of local data subjects, were submitted to the DPC as well as data protection authorities (DPAs) in Austria, Belgium, France, Greece, Italy, the Netherlands, Spain, and Poland. They alleged that X had breached a range of GDPR provisions (“at least” Articles 5(1) and (2), 6(1), 9(1), 12(1) and (2), 13(1) and (2), 17(1)(c), 18(1)(d), 19, 21(1) and 25), and called for DPAs to take provisional measures to restrict X’s processing and request a binding decision from the European Data Protection Board under Article 66 GDPR.
Potential issues raised by an EU ban
As X highlighted, a ban on using personal data from posts to train its AI models would have wide reaching consequences for its platform. Grok is only one of its models. It mentions that the ban could impact “our work to keep the platform safe and possibly the ability to offer X in the EU”; this may refer to applications for content moderation and recommendations, for example.
A wide-reaching ban framed in relation to training AI models relying on legitimate interests could also impact the many other controllers seeking to rely on legitimate interests to train their AI models.
A permanent ban might also impact on the availability of generative AI tools that are well equipped to serve the needs of EEA users. In May, Meta faced similar issues, following notifications to users that it would use public content shared by adults on Facebook and Instagram to train Meta AI. Following a request from the DPC, it announced plans to delay this training. Both Meta and X have flagged that, if EU and EEA data is not used to train their models, their models will be ill-equipped to respond to local nuances in interactions with users. Their competitors have already used publicly available data from EU and EEA users.
Points for controllers to consider
noyb’s complaints include a number of views that are not universally held, such as that processing for training generative AI models can only be based on consent. They do, however, set out an extremely comprehensive list of GDPR issues around training AI, which may be informative for controllers. Below, we set out some of the points discussed, with our commentary on constructive points for controllers to consider.
Lawful basis: Legitimate interests is likely to be the only practicable lawful basis to explore in many scenarios where AI models are being trained.
noyb’s complaints suggest that X does not have a legitimate interest in its processing. However, DPAs have suggested that web scraping on the basis of legitimate interests might be possible (which, analogously, might suggest that X’s processing was permissible). Both the French Commission Nationale de l’Informatique et des Libertés (CNIL) and the European Data Protection Board (EDPB) have suggested reliance on legitimate interests might be possible (though the Netherlands Autoriteit Persoonsgegevens takes the view that legitimate interests cannot be relied upon).
However, Article 6(1)(f) GDPR sets out a three-part test which requires, firstly, the controller’s purpose to be a legitimate interest, secondly, the processing to be necessary for that purpose, and lastly, for those interests not to be overridden by data subjects’ fundamental rights and freedoms (discussed in Meta Platforms Inc. and others v Bundeskartellamt ECLI:EU:C:2023:537 Case C‑252/21, as well as e.g. the Information Commissioner’s Office’s (ICO’s) guidance). General-purpose AI models may face issues around articulating a specific purpose for the processing, why that processing is a legitimate interest, and why the processing is necessary for that purpose. The ICO has discussed this in its call for evidence and welcomes views on addressing this.
For controllers seeking to fine-tune or train models that support activities that are already set out in their privacy notice, articulating a specific purpose is likely to be less challenging.
Transparency: controllers should look to be as specific as possible when setting out the purposes for processing to satisfy the lawfulness, fairness, and transparency principle under Article 5(1)(a), as well as the specific transparency requirements under Articles 12 and 13.
Controllers should look to update privacy notices with specific information on why they are processing data to train AI and the lawful basis relied on. Generally, the principle of fairness will require additional and more prominent information to be provided for more novel and unexpected processing.
Data subject rights: Article 13 GDPR requires that data subjects be informed of their rights, including the right to object (which applies where processing is based on legitimate interests). In some cases, to satisfy the fairness principle, it may be appropriate to provide a specific notification to data subjects and give them the opportunity to object before processing is carried out. There is no absolute obligation for a controller to stop processing when a data subject exercises the right to object under Article 21(1), but it can only continue to process the data where it can demonstrate compelling legitimate grounds which override the interests of the data subject, unless processing for defending a legal claim.
This may be particularly important for training or fine-tuning large language models (LLMs), which pose technical challenges around rectifying and deleting data once models have been trained. Where personal data is incorporated into LLMs, for example, currently it can never be truly erased or rectified, though output filters can be applied. The ICO has welcomed views on how requests are being responded to in practice. Meanwhile, the Hamburg DPA has suggested that LLMs do not store personal data, so data subject rights like the right to rectification, access, and erasure only need to be complied with for LLMs’ inputs and outputs. It should be noted that the Hamburg DPA emphasises that its discussion paper is intended to stimulate further debate, and does not suggest that it provides guidance on compliance.
Data minimisation, storage limitation, and privacy by design: When training AI models, any steps that can be taken for data minimisation, such as filtering out personal data, will also be important for satisfying privacy by design obligations. Careful consideration of which data must be used will also be helpful to demonstrate that the data subject’s rights and freedoms do not outweigh the controller’s legitimate interests. The EDPB has also emphasised that safeguards can assist in meeting the balancing test (see paragraph 17).
When embarking on an AI project, it is also worthwhile to review retention policies and ensure that data is being deleted when it is no longer needed. This will be beneficial from an organisational perspective, ensuring that the AI model is trained on current data, as well as from a GDPR compliance perspective.
Special category data: Special category data, such as health data or data about sexual orientation, can only be processed where a condition under Article 9 GDPR is satisfied. This is challenging when web scraping (or using social media posts) to train an LLM, as it may be impossible to distinguish special category data.
The European Data Protection Board examines the “manifestly made public” condition under Article 9(2)(e) and whether it might apply (see paragraph 18), though stresses that the data subject must have intended, explicitly and by clear affirmative action, for the data to be made public. The CNIL has also referred to the “manifestly made public” condition (though without confirming that it would apply to web scraping). It advocates for the use of filters, though suggests that incidental collection of special category data where technical measures are used to avoid its collection would not necessarily be unlawful provided it can be removed later where required to do so.
For AI projects where data sets are selected specifically at the outset, controllers should ensure that special category data is not processed or that an Article 9 condition is satisfied.
Our take
Training AI models, and particularly LLMs, can pose some unique challenges for GDPR compliance. These challenges are particularly acute for those carrying out the initial training of LLMs and other generative AI models. However, fine-tuning an LLM provided by another organisation, or indeed other AI projects, will also pose data protection challenges due to the complexity and novelty of the technologies involved. As emphasised by DPAs, such as the ICO and CNIL, it will be necessary to consider fair processing notices, lawful grounds and how data subject rights will be satisfied and to carry out a DPIA; the results of such exercises may be somewhat inconclusive until definitive enforcement or clearer guidance emerges.
AI projects should always be carried out within the framework of a robust AI governance programme assessing risks to the organisation, individuals, and society more broadly. The risks to data subjects’ rights and freedoms must be assessed alongside other risks such as discrimination, intellectual property (both in terms of ownership and infringement), and any sector specific laws such as financial services regulation, as well as assessing whether obligations arise under the EU AI Act.
Our European data protection and AI conference, held in Frankfurt, Paris, London and Amsterdam between 30 September and 3 October 2024, will provide updates on the latest data protection trends and insight into building your AI governance programme.