Do Machine Learning Models Store Protected Content?

Author:Murphy | View: 25233 | Time: 2025-03-22 21:46:08

From chatGPT to Stable Diffusion, Artificial Intelligence (AI) is having a summer the likes of which rival only the AI heydays of the 1970s. This jubilation, however, has not been met without resistance. From Hollywood to the Louvre, AI seems to have awoken a sleeping giant – a giant keen to protect a world that once seemed exclusively human: creativity.

For those desiring to protect creativity, AI appears to have an Achilles heel: training data. Indeed, all of the best models today necessitate a high-quality, world-encompassing data diet – but what does that mean?

First, high-quality means human created. Although not-human-created data has made many strides since the idea of a computer playing itself was popularized by War Games, computer science literature has shown that model quality degrades over time if humanness is completely taken out of the loop (i.e., model rot or model collapse). In simple terms: human data is the lifeblood of these models.

Second, world-encompassing means world-encompassing. If you put it online, you should assume the model has used it in training: that Myspace post you were hoping only you and Tom remembered (ingested), that picture-encased-memory you gladly forgot about until PimEyes forced you to remember it (ingested), and those late-night Reddit tirades you hoped were just a dream (ingested).

Models like LLaMa, BERT, Stable Diffusion, Claude, and chatGPT were all trained on massive amounts of human-created data. And what's unique about some, many, or most human-created expressions – especially those that happen to be fixed in a tangible medium a computer can access and learn from – is that they qualify for Copyright protection.

Anderson v. Stability AI; Concord Music Group, Inc. v. Anthropic PBC; Doe v. GitHub, Inc.; Getty Images v. Stability AI; {Tremblay, Silverman, Chabon} v. OpenAI; New York Times v. Microsoft

Fortuitous as it may be, the data these models cannot survive without is the same data most protected by copyright. And this gives rise to the titanic copyright battles we are seeing today.

Of the many questions arising in these lawsuits, one of the most pressing is whether models themselves store protected content. This question seems rather obvious, because how can we say that models – merely collections of numbers (i.e., weights) with an architecture – "store" anything? As Professor Murray states:

Many of the participants in the current debate on visual generative AI systems have latched onto the idea that generative AI systems have been trained on datasets and foundation models that contained actual copyrighted image files, .jpgs, .gifs, .png files and the like, scraped from the internet, that somehow the dataset or foundation model must have made and stored copies of these works, and somehow the generative AI system further selected and copied individual images out of that dataset, and somehow the system copied and incorporated significant copyrightable parts of individual images into the final generated images that are offered to the end-user. This is magical thinking.

Michael D. Murray, 26 SMU Science and Technology Law Review 259, 281 (2023)

And yet, models themselves do seem, in some circumstances, to memorize training data.

The following toy example is from a Gradio Space on HuggingFace which allows users to pick a model, see an output, and check – from that model's training data – how similar the generated image is to any image in its training data. MNIST digits were used to generate because they are easy for the machine to parse, easy for humans to interpret in terms of similarity, and have the nice property of being easily classified – allowing a hunt of similarity to only consider images that are of the same number (efficiency gains).

Let's see how it works!

The following image has a similarity score of .00039. RMSE stands for Root Mean Squared Error and is a way of assessing the similarity between two images. True enough, many other methods for similarity assessment exist, but RMSE gives you a pretty good idea of whether an image is a duplicate or not (i.e., we are not hunting for a legal definition of similarity here). As an example, an RMSE of <.006 gets you into the nearly "copy" range, and an RMSE of <.0009 is entering perfect copy territory (indistinguishable to the naked eye).