Artists and writers are up in arms about generative artificial intelligence systems—understandably so. These machine learning models are only capable of pumping out images and text because they’ve been trained on mountains of real people’s creative work, much of it copyrighted. Major AI developers including OpenAI, Meta and Stability AI now face multiple lawsuits on this. Such legal claims are supported by independent analyses; in August, for instance, the Atlantic reported finding that Meta trained its large language model (LLM) in part on a data set called Books3, which contained more than 170,000 pirated and copyrighted books.
And training data sets for these models include more than books. In the rush to build and train ever-larger AI models, developers have swept up much of the searchable Internet. This not only has the potential to violate copyrights but also threatens the privacy of the billions of people who share information online. It also means that supposedly neutral models could be trained on biased data. A lack of corporate transparency makes it difficult to figure out exactly where companies are getting their training data—but Scientific American spoke with some AI experts who have a general idea.
Where do AI training data come from?
To build large generative AI models, developers turn to the public-facing Internet. But “there’s no one place where you can go download the Internet,” says Emily M. Bender, a linguist who studies computational linguistics and language technology at the University of Washington. Instead developers amass their training sets through automated tools that catalog and extract data from the Internet. Web “crawlers” travel from link to link indexing the location of information in a database, while Web “scrapers” download and extract that same information.
A very well-resourced company, such as Google’s owner, Alphabet, which already builds Web crawlers to power its search engine, can opt to employ its own tools for the task, says machine learning researcher Jesse Dodge of the nonprofit Allen Institute for AI. Other companies, however, turn to existing resources such as Common Crawl, which helped feed OpenAI’s GPT-3, or databases such as the Large-Scale Artificial Intelligence Open Network (LAION), which contains links to images and their accompanying captions. Neither Common Crawl nor LAION responded to requests for comment. Companies that want to use LAION as an AI resource (it was part of the training set for image generator Stable Diffusion, Dodge says) can follow these links but must download the content themselves.
Web crawlers and scrapers can easily access data from just about anywhere that’s not behind a login page. Social media profiles set to private aren’t included. But data that are viewable in a search engine or without logging into a site, such as a public LinkedIn profile, might still be vacuumed up, Dodge says. Then, he adds, “there’s the kinds of things that absolutely end up in these Web scrapes”—including blogs, personal webpages and company sites. This includes anything on popular photograph-sharing site Flickr, online marketplaces, voter registration databases, government webpages, Wikipedia, Reddit, research repositories, news outlets and academic institutions. Plus, there are pirated content compilations and Web archives, which often contain data that have since been removed from their original location on the Web. And scraped databases do not go away. “If there was text scraped from a public website in 2018, that’s forever going to be available, whether [the site or post has] been taken down or not,” Dodge notes.
Some data crawlers and scrapers are even able to get past paywalls (including Scientific American’s) by disguising themselves behind paid accounts, says Ben Zhao, a computer scientist at the University of Chicago. “You’d be surprised at how far these crawlers and model trainers are willing to go for more data,” Zhao says. Paywalled news sites were among the top data sources included in Google’s C4 database (used to train Google’s LLM T5 and Meta’s LLaMA), according to a joint analysis by the Washington Post and the Allen Institute.
Web scrapers can also hoover up surprising kinds of personal information of unclear origins. Zhao points to one particularly striking example where an artist discovered that a private diagnostic medical image of herself was included in the LAION database. Reporting from Ars Technica confirmed the artist’s account and that the same data set contained medical record photographs of thousands of other people as well. It’s impossible to know exactly how these images ended up being included in LAION, but Zhao points out that data get misplaced, privacy settings are often lax, and leaks and breaches are common. Information not intended for the public Internet ends up there all the time.
In addition to data from these Web scrapes, AI companies might purposefully incorporate other sources—including their own internal data—into their model training. OpenAI fine-tunes its models based on user interactions with its chatbots. Meta has said its latest AI was partially trained on public Facebook and Instagram posts. According to Elon Musk, the social media platform X (formerly known as Twitter) plans to do the same with its own users’ content. Amazon, too, says it will use voice data from customers’ Alexa conversations to train its new LLM.
But beyond these acknowledgements, companies have become increasingly cagey about revealing details on their data sets in recent months. Though Meta offered a general data breakdown in its technical paper on the first version of LLaMA, the release of LLaMA 2 a few months later included far less information. Google, too, didn’t specify its data sources in its recently released PaLM2 AI model, beyond saying that much more data were used to train PaLM2 than to train the original version of PaLM. OpenAI wrote that it would not disclose any details on its training data set or method for GPT-4, citing competition as a chief concern.
Why are dodgy training data a problem?
AI models can regurgitate the same material that was used to train them—including sensitive personal data and copyrighted work. Many widely used generative AI models have blocks meant to prevent them from sharing identifying information about individuals, but researchers have repeatedly demonstrated ways to get around these restrictions. For creative workers, even when AI outputs don’t exactly qualify as plagiarism, Zhao says they can eat into paid opportunities by, for example, aping a specific artist’s unique visual techniques. But without transparency about data sources, it’s difficult to blame such outputs on the AI’s training; after all, it could be coincidentally “hallucinating” the problematic material.
A lack of transparency about training data also raises serious issues related to data bias, says Meredith Broussard, a data journalist who researches artificial intelligence at New York University. “We all know there is wonderful stuff on the Internet, and there is extremely toxic material on the Internet,” she says. Data sets such as Common Crawl, for instance, include white supremacist websites and hate speech. Even less extreme sources of data contain content that promotes stereotypes. Plus, there’s a lot of pornography online. As a result, Broussard points out, AI image generators tend to produce sexualized images of women. “It’s bias in, bias out,” she says.
Bender echoes this concern and points out that the bias goes even deeper—down to who can post content to the Internet in the first place. “That is going to skew wealthy, skew Western, skew towards certain age groups, and so on,” she says. Online harassment compounds the problem by forcing marginalized groups out of some online spaces, Bender adds. This means data scraped from the Internet fail to represent the full diversity of the real world. It’s hard to understand the value and appropriate application of a technology so steeped in skewed information, Bender says, especially if companies aren’t forthright about potential sources of bias.
How can you protect your data from AI?
Unfortunately, there are currently very few options for meaningfully keeping data out of the maws of AI models. Zhao and his colleagues have developed a tool called Glaze, which can be used to make images effectively unreadable to AI models. But the researchers have only been able to test its efficacy with a subset of AI image generators, and its uses are limited. For one thing, it can only protect images that haven’t previously been posted online. Anything else may have already been vacuumed up into Web scrapes and training data sets. As for text, no such similar tool exists.
Website owners can insert digital flags telling Web crawlers and scrapers to not collect site data, Zhao says. It’s up to the scraper developer, however, to opt to abide by these notices.
In California and a handful of other states, recently passed digital privacy laws give consumers the right to request that companies delete their data. In the European Union, too, people have the right to data deletion. So far, however, AI companies have pushed back on such requests by claiming the provenance of the data can’t be proven—or by ignoring the requests altogether—says Jennifer King, a privacy and data researcher at Stanford University.
Even if companies respect such requests and remove your information from a training set, there’s no clear strategy for getting an AI model to unlearn what it has previously absorbed, Zhao says. To truly pull all the copyrighted or potentially sensitive information out of these AI models, one would have to effectively retrain the AI from scratch, which can cost up to tens of millions of dollars, Dodge says.
Currently there are no significant AI policies or legal rulings that would require tech companies to take such actions—and that means they have no incentive to go back to the drawing board.