How AI companies appear to violate the GDPR

We’re living in the middle of one of the largest tech gold rushes in recent history. OpenAI’s chatbot reached 100 million users in two months. Hoping to keep up, Google introduced its own AI chatbot, and Microsoft added a chatbot to its Bing search engine (with… mixed results(new window)).

Much like a real gold rush, the race to perfect AI is built on a haphazard exploitation of resources. Only instead of chopping down forests or damming rivers, these AI companies are exploiting our personal data.

If you’ve ever written a blog post, submitted an online review of a product, or posted on sites like Reddit and Stack Overflow, you likely inadvertently contributed your data to help train an AI model. And if these companies collected the data of anyone living in Europe, they’re likely guilty of violating the GDPR: ChatGPT has already been blocked, albeit temporarily, over privacy concerns.

This shaky start to regulating powerful language models shows that the future of AI policy has not yet been written. Once again, Big Tech companies stand to profit in the billions off of your personal data without your consent. Will regulators and consumers once again go along with it?

Italy’s ChatGPT block is just the beginning

On March 31, Italy’s data protection agency (DPA), the Garante, issued a stop-processing order(new window) against ChatGPT, which led to OpenAI geoblocking potential users with an Italian IP address. Two weeks later, the Garante issued a list of demands that OpenAI would have to meet to resume service within the country.

This list included several privacy protections, including:

Age-gating ChatGPT so minors cannot access it
Providing a more thorough explanation of what data is processed to train ChatGPT’s algorithms
Enabling people to opt out of such processing

As of April 28, 2023, after ChatGPT implemented these measures, the Garante lifted its ban. In an expanded help center article(new window), OpenAI claims it’s using legitimate interest (as defined by the GDPR) as the legal basis for collecting and processing data to train its algorithms.

While ChatGPT is no longer banned, Garante’s order may have just been the first salvo. France, Germany, and Ireland’s DPAs(new window) have communicated with the Garante and are contemplating their own investigations. Spain’s DPA(new window) has announced its own investigation. And the EU’s European Data Protection Board(new window) announced it will launch a ChatGPT task force.

Is it legal to scrape the internet to train AI?

In the previously mentioned help center article, OpenAI clarified that it did use information scraped from the internet to train ChatGPT. The fact that it was initially unclear where this data came from implies that OpenAI collected all this data without the express permission of the data subjects.

The French data protection agency (DPA) has issued guidance in the past stating that even if an individual shares their contact information publicly, it still qualifies as personal information and cannot be freely used or processed by a company without the person’s knowledge. Assuming that DPAs are willing to treat other types of personal information like contact information, ChatGPT’s web scraping seems to be a clear violation of the GDPR given it doesn’t fulfill any of the other requirements of Article 6(new window) of the GDPR.

Since it’s also likely that ChatGPT gathered all of these datasets en masse without any explicitly defined use case, it would also appear to be a clear violation of the principle of data minimization as laid out in Article 5.1.c(new window) of the GDPR.

Given the way AI models are structured, there’s no legitimate way to ensure the GDPR’s ‘right to be forgotten’ can be enforced on data that has been scraped from the web, a clear violation of Article 17(new window) of the GDPR. ChatGPT appears to have introduced a mechanism that would allow people to prevent it from storing and using the prompts they feed it to train the algorithm, but the data these companies scraped to train their AI in the first place will be much harder to disentangle.

Finally, there’s the fact that OpenAI is an American company. Since Schrems II(new window), a court decision that requires cloud providers to verify the data protections of countries before they transfer data there, the EU has (correctly) taken a critical stance on the privacy protections of the US. OpenAI, a US company, must prove it has implemented adequate safeguards before it can transfer the data of individuals living in Europe to the US without their express permission. Otherwise, it would be in violation of Article 46(new window) of the GDPR.

OpenAI’s privacy policy(new window) speaks briefly about data transfers, saying only that it will “use appropriate safeguards for transferring Personal Information outside of the EEA, Switzerland, and the UK”.

This is simply scratching the surface. These are all the likely GDPR violations committed just in the creation and publication of the AI models.

In its help center article, ChatGPT claims that since training AI requires massive amounts of data, it had no choice but to scrape the internet. It also says the information was already publicly available and that it had no intention of negatively impacting individuals. It also emphasizes that it doesn’t use individuals’ data to build personal profiles, contact or advertise to people, or sell any products. Unfortunately for OpenAI, none of these points are justifications for data processing under the GDPR.

AI companies’ exposure has increased even more now that third-party companies are applying ChatGPT to various functions, like helping with customer service calls(new window). Unless people’s data is properly anonymized or they expressly consent to speak with an AI chatbot, these third-party companies will also be committing GDPR violations.

It’s also worth pointing out that the GDPR wasn’t written to deal with AI. Even though these appear to be clear GDPR violations, the way AI works somewhat scrambles the distinction between data subjects, data controllers, and data processors. We won’t have clarity on these issues until DPAs and the courts make their rulings.

Google’s unusual privacy policy

Google isn’t new to artificial intelligence, having pioneered “neural networks” with Google Translate and innovations in understanding the intent behind people’s searches. It has even developed its own large language model, LaMDA(new window).

What is new is Google’s privacy policy, which was recently updated to grant the company broad authority to scrape the entire internet.

In a July 2023 update, Google added a small line to its privacy policy(new window) under the “Business purposes for which information may be used or disclosed” section: “Google uses information to improve our services and to develop new products, features and technologies that benefit our users and the public. For example, we use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”

The “publicly available information” wording mirrors OpenAI’s description of the data it uses to train its AI products. It tells us very little about the precise kinds of data used to train their models. The implication is that all data is fair game.

What is genuinely different about Google’s privacy policy is that it seems to be directed at the global population, not just people who use Google services. Not even OpenAI’s privacy policy includes a clause like this.

It will be difficult for Google to argue that it obtained the consent of EU citizens before processing their data when its only indication that it would do so is contained in a tiny “for example” directed at no one in particular.

Copyright law and companies might come for AI next

ChatGPT and other AI services are facing scrutiny from businesses as well as public regulators. JPMorgan Chase(new window), Amazon(new window), and Samsung(new window) have restricted the use of AI tools, while sites like Reddit(new window), Stack Overflow(new window), and Getty Images(new window) have demanded compensation from AI companies or sued them. JPMorgan Chase told its employees not to use ChatGPT for fear that sharing sensitive client information with the chatbot could violate financial regulations.

Amazon and Samsung are worried their proprietary data could be used to train ChatGPT. As one of Amazon’s lawyers said in the company Slack, “This is important because your inputs may be used as training data for a further iteration of ChatGPT, and we wouldn’t want its output to include or resemble our confidential information (and I’ve already seen instances where its output closely matches existing material).” Samsung implemented its ban after it discovered its developers had already uploaded sensitive code to ChatGPT.

Getty Images has gone the furthest and, in February 2023, filed a lawsuit in the UK accusing Stability AI, the company behind the AI art tool Stable Diffusion, of violating copyright law. Getty Images claims that Stability AI “unlawfully copied and processed” millions of its stock photo images that are protected by copyright. It doesn’t help that Getty Images watermarks are relatively common in Stable Diffusion images.

Stability AI made the dataset it used to train its algorithm publicly available. This has allowed independent experts to examine the data(new window) and conclude that it contains a substantial amount of images from Getty. Nor is it the only AI tool facing accusations of copyright violations or plagiarism.

https://twitter.com/erockappel/status/1652786155665096704

Similarly, Reddit and Stack Overflow have said they’ll begin charging AI companies for access to their APIs. “Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with,” said Reddit’s CEO, Steve Huffman, to The New York Times(new window).

This is precisely why many other AI companies, including OpenAI, have been much more cagey about the data they use — they fear full transparency could lead to even more regulatory and copyright woes.

So why aren't the big AI companies more transparent about what's in the data that they use to train their models?

One reason, experts say, is because they're afraid they'd get in trouble if people found out. pic.twitter.com/im1cfrSXuA(new window)
— Will Oremus (@WillOremus) April 19, 2023(new window)

AI companies haven’t earned our trust

While it remains an open question what will happen to ChatGPT, Stable Diffusion, Dall-E, and other AI tools, this has all happened before.

Before OpenAI, there was Clearview AI(new window). This facial surveillance company trained its AI with millions of photos it scraped off of social media without anyone’s consent. It has since fought numerous cease-and-desist orders and continues to operate thanks to the US’s poor legal privacy protections.

Following this model, AI companies have forged ahead, creating a mix of data that is nearly impossible to untangle. AI companies are still following the outdated and dangerous “move fast and break things” approach, but taking it to another level.

The GDPR may not have been written with AI in mind, but it’s still the strongest data protection legislation so far. Fortunately, the EU is now working on a proposal for its Artificial Intelligence Act(new window). If all goes to plan, the final proposal should be available in June this year, and enforcement of the law could begin as early as late 2024.

AI has the potential to be a truly revolutionary development, one that could drive advancement for centuries. But it must be done correctly. These companies stand to make billions of dollars in revenue, and yet they violated our privacy and are training their tools using our data without our permission. Recent history shows we must act now if we’re to avoid an even worse version of surveillance capitalism.

Updated on July 13, 2023 to discuss Google’s update to its privacy policy.

AI vs. the GDPR