One of the biggest questions right now is, does using copyrighted work to train machine learning models constitute fair use? I think, by the definition set under common law, it does. But it shouldn’t. Let me explain.
The implication of fair use under the existing model
The concept of fair use assumes time and effort. If I want to make a video explaining AI, I first have to go and learn about AI. Maybe I do so by watching someone else’s video. Eventually, I can make my own. Whilst I may essentially just be rehashing someone else’s video, I am putting in time researching, filming, editing, as well as potentially adding value with my own unique opinions. All of this puts time constraint on how quickly I can rehash other people’s content.
Now, suppose I have a machine that can do all this for me. It’s already harvested every video on the internet and analyzed them. All I need to do is specify the kind of video I want it to make, and within seconds it spits out a final product. The instant someone posts a new video, I can use my machine to produce a similar video in a way that constitutes fair use. Essentially, I can rapidly replicate any original content published online with little to no effort. This is what I’m going to refer to as copyright laundering.
Copyright laundering
Copyright laundering is not a new phenomenon. YouTube went through a huge trend with react videos. People would simply re-upload already viral videos, add a split screen where they’d recorded their face watching the video, then claim it was fair use under the ‘additional commentary’ clause. Before YouTube eventually cracked down, it was infinitely more profitable to steal other people’s videos via this copyright loophole, rather than create your own.
Generative AI provides a fast and easy means to launder copyright, since it’s built upon scraping data from the open internet. Anyone can effortlessly produce content that is not subject to the copyright of wherever the data came from. Now, this might seem like a natural evolution in technology. AI makes creating content faster, so everyone creates content with AI. But, there’s a catch.
The need for original content and creator incentive
Generative AIs cannot create original content. They are only able to create derivatives of the works they were trained on. Most of this training data comes from the open internet, where platforms and creators are funded by ad-revenue.
AI Search as a revenue circumventing proxy for content
Let’s assume everyone progresses to using AI. Large Language Models (LLMs) like ChatGPT, Bing, or Bard replace search engines. Every time I need to know something, I just ask the LLM. Everytime I need to create something, I just ask the LLM. Even if we assume the LLM cites the original source, which most don’t, there is no reason for me to engage with that. If the LLM can answer all my questions, and interpret information in any way I desire, I need not visit the source website, watch the source video, or read the source book. So then, where is the incentive for the original creators? The ones who created the training data these LLMs need to exist?
When an LLM visits a website and scrapes its content, it doesn’t view ads, it doesn’t buy products, it doesn’t give recognition, and it doesn’t donate to creators. LLMs are a net negative for creators. They steal content, use up valuable bandwidth, and leave nothing in return.
Creation without financial incentive
One could argue that in a utopia where everyone’s needs are met, where everyone has food, water, healthcare, and shelter, people would still create. Be it out of interest, curiosity, or a desire for recognition. But this is not the current reality. We live in a world where one must work to survive, and LLMs don’t consider content creation to be work. If it’s not possible to make money creating original content, then less and less people will be able to afford to do so.
Essentially, we’re building a system of first mover disadvantage. One where you spend time, money, and skill to create original works, only to lose out on credit and profit as your work is quietly taken and fed into corporate AIs.
A continuing trend
Pre-AI, we were already seeing many news websites moving behind paywalls. Bloggers and artists began following suit with moving to subscription based platforms like Patreon and Medium. The internet was already tipping in favor of small but loyal subscriber bases, over far-reaching open platforms funded by ad-revenue. The advancement of LLMs is likely to hyper-accelerate this. Paywalling will not just serve as the only remaining means of revenue, but as a gate to keep AIs out.
Since the LLMs require access to original works for training, search, and testing, forcing creators to paywall content undermines their very existence. The AI snake is eating its own tail.
Model Collapse
ML models require training data that is of high quality. The output is only as good as its inputs, a phenomenon known in engineering as ‘garbage in, garbage out’. Many AIs produce output significantly below the average quality of their input, hereby referred to as ‘garbage’. When high-quality data moves behind paywalls, and AI generate garbage takes over, the models will undergo regression to the mean. The garbage in produces even more garbage garbage out, with the garbage garbage out feeding back in to create garbage garbage garbage out… You get the idea.
Regression may have already began
Research shows that some models have already begun to degrade, though it is unclear if this is due to model collapse or some other phenomenon. Since the output deterioration coincides with the rush to connect LLMs to the internet, I would suspect it’s at least in part due to the introduction of AI generated content into the training data.
A while back I tried writing an article with ChatGPT for research purposes. It was extremely hard to get something that looked even remotely natural. The LLM would simply take the core point of the article, then rephrase it 15 different ways throughout the text. It was borderline unreadable. Now knowing what to look for, I’ve begun to notice a lot of very obviously ChatGPT generate articles online. If it’s already struggling with decent input data, imagine how it’s going to look when it starts learning from these posts.
Not a self-correcting problem
It may be tempting to assume the problem with simply solve itself. The models will collapse under their own weight, then we’ll just go back to the way things were. Unfortunately, I don’t believe this to be the case. Reinforcement Learning from Human Feedback, or RLHF, is a component of LLMs that allows their output to be adjusted based on human feedback. Users can, for example, vote on whether an output is correct, or help choose a better alternative.
RLHF may allow existing models trained on the pre-AI internet to resist model collapse; however, they still require a large user-base to achieve this. In such a case, existing models with large user-bases could coast by on RLHF, smaller models may collapse, and new models would have neither the data to train on (due to paywalls and AI generated content), nor the user-base to vet it. There is a distinct risk that we are moving toward a monopoly where large AI companies serve as the gatekeepers of information.
In conclusion
There are many potential problems posed by generative AI, regardless of how we interpret copyright law. I don’t know that any royalty based system or re-definition of what constitutes fair use is practical, nor likely to change the direction in which we’re heading.
Although we are facing an increasing threat to the free and open internet, reach and influence will always be desirable, despite the ever-growing cost. You and I may not have the incentive or funding to maintain blogs read only by machines, but corporations and governments looking to peddle disinformation have plenty.
More regulation and better antitrust laws could help, but resisting change altogether has never worked out well. Ultimately, as many have suggested, this is just another reason why we need better social safety nets. Not only must we prepare for monopolies cornering the market, but also access the to information itself.