Apr 3, 2026

The Legal Cases Shaping How AI Labs Should Think About Data Licensing

Sona Poghosyan

Four copyright lawsuits filed in the past two years have put the biggest names in AI on the wrong end of federal complaints. OpenAI, Anthropic, Midjourney, Perplexity. Each case comes at the same dispute from a different angle: when an AI company uses someone else's work to build a product, what do they owe the person who made it?

The answers will determine what it costs to build AI, what data you can legally train on, and whether the licensing deals you sign today hold up in court tomorrow. If you're building an AI system and deciding what to train it on, this is the landscape you're in.

What Data Licensing Actually Means

When someone creates something — a novel, a news article, a dictionary entry, a film character — they own the right to decide how it gets copied and used. If a company wanted to use that work commercially, they'd traditionally pay for a license. A streaming platform licenses movies. A music app licenses songs.

When you build a large language model or image generator, the process happens in stages. During pretraining, the model learns general patterns from massive datasets by analyzing text or images and learning statistical relationships between words, concepts, and visual elements.

The model builds mathematical representations called parameters that encode patterns from the training data, adjusting these values millions of times as it learns to predict what comes next based on patterns across the entire dataset. Fine-tuning then follows, where you take that pretrained model and train it further on more specialized data to improve performance on specific tasks.

Data licensing matters because companies can either build these models using data they own, data in the public domain, or data licensed from creators and rights holders. When you license data, you're paying for the right to use that content during these training stages, and the terms of that license determine what you can and cannot do with the model afterward.

One concept worth understanding before the cases below: retrieval-augmented generation, or RAG. Many AI products use RAG to stay accurate. Instead of relying on training data alone, the model pulls live web content to shape each answer. The legal problem is that RAG means copying and serving someone else's content every time a user asks something. That's not learning from a source. It's reprinting it on demand, and as the Perplexity case shows, courts are treating it very differently from training.

The New York Times v. OpenAI and Microsoft (Ongoing)

In December 2023, the Times sued OpenAI and Microsoft. The claim: ChatGPT and Bing Chat had been trained on Times articles without permission, and the models could reproduce that content in ways that undercut the Times' own subscribers. Why pay for a subscription if a chatbot gives you the article for free?

The complaint included concrete examples of ChatGPT spitting out paywalled articles almost word for word. It also documented cases where the models hallucinated content and falsely attributed it to the Times — a separate harm, since readers might believe the Times wrote something it never did.

Copyright law has a concept called fair use, which can excuse copying if certain conditions are met. One of those conditions is whether the copying hurts the market for the original work. The Times came prepared on that point: they had examples of verbatim output, a subscription model that the chatbot was clearly threatening, and a paper trail of licensing deals they'd struck with other tech companies.

The part of this case that matters most for AI companies is about what the model can be made to produce. Most disputes focus on whether ingesting data during training is infringement. The Times raised a harder question: even if training is fine, what about a model that can be prompted to reproduce the original article on demand? Courts are likely to treat that as a different kind of harm entirely.

Disney and Universal v. Midjourney

In June 2025, Disney and Universal filed a 110-page complaint against Midjourney, the AI image generator. The complaint ran dozens of side-by-side comparisons: Midjourney outputs next to the copyrighted characters they closely resembled. Elsa from Frozen. Characters from Star Wars. Shrek.

Midjourney says its tool creates new images. Disney says it copies them. Type "Princess Elsa singing in front of an ice castle, Frozen animated movie" and you get back something indistinguishable from the character Disney owns. The plaintiffs called that reproduction as a feature. Midjourney called it creative generation. That's what the case will decide.

The complaint also cited a 2022 interview with Midjourney's CEO David Holz, in which he said the company collected all the data it could find and never sought permission from rights holders. He floated the idea of an opt-out system but said they hadn't built one. Courts can treat statements like that as evidence that the infringement was deliberate rather than accidental.

Encyclopedia Britannica and Merriam-Webster v. Perplexity AI

Filed in September 2025, this case is about a different kind of product. Perplexity is an answer engine. Instead of returning a list of links, it reads the web and gives you a summary. Britannica and Merriam-Webster say that summary is their content, repackaged, with no reason left for anyone to visit the original.

The complaint breaks the alleged infringement into three stages. Perplexity's crawler scrapes the plaintiffs' websites. Its RAG system then copies full articles to generate answers. Those answers can reproduce the source material nearly verbatim. The examples are hard to argue with. Ask Perplexity to define "plagiarize" and it returned the exact Merriam-Webster definition, word for word.

The complaint also raises a trademark claim. Trademark law protects a brand's reputation and what it signals to the public. When Perplexity tells users the answer comes from Britannica, those users expect Britannica's accuracy and authority. When the model hallucinates details or silently drops paragraphs, it damages what the Britannica name stands for, without Britannica's consent.

The U.S. Copyright Office has said in a recent report that RAG is less likely to qualify as fair use than training because it's not transformative. By the time this suit was filed, Perplexity had already licensed content from Time, Fortune, and Wiley. The complaint turned that into an admission; if you licensed from some publishers, you knew there was a market for this.

Bartz v. Anthropic

A group of authors sued Anthropic over the books used to train its Claude models. Nobody disputed that Anthropic had copied the books. The question was whether that copying was protected by fair use.

The picture was complicated. Anthropic had downloaded millions of books from pirated sources, including Books3, LibGen, and the Pirate Library Mirror. It had also purchased and physically scanned millions of print books it legally owned. The authors argued both were infringing.

Judge William Alsup, ruling in June 2025, split on those two groups. For the legitimately purchased books, he found fair use. His reasoning was that training a language model on a book is not the same as reproducing it. The goal is to build a system that generates new writing, not to copy the originals.

For the pirated books, he went the other way. Internal communications cited in the complaint showed that Anthropic turned to pirated libraries to avoid the time and cost of proper licensing. The company downloaded over seven million books it never paid for, kept them long after training was done, and gave hundreds of engineers access to the archive. Alsup found no version of fair use that could cover that.

After the ruling, Anthropic agreed to a $1.5 billion settlement with the author class, which received preliminary court approval.

What These Cases Are Actually Teaching You

The companies that come through this period in good shape are the ones making deliberate, documented choices about what they train on and what their models can produce. Not the ones hoping fair use covers everything.

The Bartz ruling, alongside a similar ruling in a case against Meta, suggests that training on lawfully acquired data is defensible as fair use — at least when the model's outputs don't reproduce the source material.

Pirated data is a different situation entirely. No court has excused it. Bartz said directly that a transformative downstream use doesn't retroactively justify stealing the inputs. If your training set includes Books3, LibGen, or similar sources, fair use won't protect you.