Do you agree with the analysis presented by the ICO, in particular that legitimate interest can be relied on for the creation of training generative AI data sets?

Short answer

No. We believe that using legitimate interest to justify GenAI processing activities is wrong and risks devaluing this lawful basis, making it increasingly less precise and reliable. The purpose of GenAI is that its use cases are vast and unpredictable. There are a very broad range of applications to which GenAI can be deployed and it is simply not possible to assess with accuracy the risks to the rights and freedoms of data subjects that GenAI poses.

More detailed explanation and analysis

The ICO’s analysis attempts to fit a square peg into a round hole. The legitimate interest test was not created with something as radical as generative artificial intelligence (“GenAI”) in mind. The ICO’s conclusions are that training GenAI models on data sets created from web-scraped data can be justified as a legitimate interest if the developers take their obligations seriously and can evidence them in practice. This basic premise is flawed because of the myriad of complex legal obligations in this area, combined with the rapid real-world rate of change in AI technology, governance and law.

A more granular analysis of our objections is as follows:

Absence of Privacy By Design: It is important that GenAI and other new technologies are undertaken with privacy built into the design phase. It is hard to accurately address privacy concerns in a meaningful way in the design phase, where it is impossible to know what data is/isn’t being collected and how that data will ultimately be used by third parties or the public.
Unreliable Web-Based Foundation Data: Data collected from web-scraping is inherently unreliable. Using web-based data for pre-processing and training models will involve the use of personal data which contains errors or other inaccuracies. It may also involve the unlawful collection of information where copyright (or other proprietary rights) are alleged to have been infringed, or contract terms breached, in addition to data protection law concerns.
Not In Reasonable Expectation of Data Subjects: We do not believe that web-scraping data to create GenAI training data has been or will ever be in the reasonable expectation of the vast majority of UK data subjects, given the lack of relationship between them and the GenAI developers and system deployers.
Wider Societal Interests Is Too Vague: We disagree that a developer could rely on wider societal interests as a reason for passing limbs 1 and 3 of the legitimate interest test. GenAI, especially the kinds which are mass-market, publicly available, large language models, are not comparable to other less-invasive processing which have a substantial public interest. This is compounded by the real-world problem of not being able to monitor or take action regarding third party use of GenAI.
Special Category Personal Data Not Considered Sufficiently: We do not believe the ICO has given adequate weight in the consultation to the risks and harms created by processing of special category personal data in GenAI models. Web-scraping and pre-processing by their nature will result in collection and use of special category personal data. This is the case even where technological parameters are set in the scraping regime to try and minimise that data collection outcome. Special category personal data is a deliberately elastic concept; its presence in a data set can be dependent upon issues as vague and imprecise as context, setting and tone. The protections enshrined in law regarding special category personal data give data subjects an expectation of control over their special category personal data and of freedom from interference with their special category personal data, except in specific prescribed scenarios such as those in Article 9 UK GDPR and Schedule 1 DPA.

Proposed solutions

We propose the following solutions for discussion by the ICO. If you think it would be useful, we would be happy to participate in those discussions with the ICO and to elaborate on the foundation of our thinking:

Solution 1: Change the Legitimate Interest Guidance: There are several ways in which the current ICO guidance on legitimate interests restrict use of GenAI. That guidance, and the guidance on the use of AI, could be revised to help accommodate GenAI within the existing legitimate interest lawful basis. One example is to change the guidance so that GenAI developers do not need to identify a specific benefit which they can evidence (something which will more often than not be extremely difficult to do to a high standard).

Solution 2: Amend the Legitimate Interest in Law: This approach would keep legitimate interest as the identified basis for GenAI, but would modify how it works, in a manner similar to that set out above but on a statutory footing. The benefit of this is that it would give greater gravity and formality to the change and would (arguably) fit within the HM Government’s stated objective of encouraging AI innovation, rather than regulation. There is also a precedent for this in the current draft of the Data Protection & Digital Information No.2 Bill (“DPDI Bill”). The detriment with this solution is that it would require a political process to be completed (more on that below) although, if initiated quickly, such a process could possibly be absorbed into the current DPDI No.2 Bill.

Solution 3: Create A New Lawful Basis for GenAI: No lawful basis is adequate for GenAI. Legitimate interest is, we agree, the one with the greatest potential to achieve the task, but that does not overcome the conceptual and practical hurdles set out above. A new lawful basis could be tailored to something as revolutionary as GenAI, which is of benefit to society as a whole. There are several benefits of this solution, the main ones being that it would guarantee an outcome in which current GenAI practices are compliant with a (new) law and avoid confusing the application of legitimate interest. The downsides are also plentiful: this solution would require political will and the passing of new law, which is hard to do and time consuming. It is unlikely to get momentum in an election year. It is also vulnerable to the theoretical challenge of undue accommodation: if you change the law to make this fit, will you change the law every time new technology comes along? The mechanics and drafting of a new law are likely to create something which may appear overall, very much like the legitimate interest test without limb 3.