Storytellers are Developers Pt. 2: Beyond the Scrape – The Future of AI Training Data

01.17.25
Episode 039
AI, Generative AI, Ideas, Media & Entertainment

Explore the growing challenges facing AI companies as they struggle to find enough quality data to train their models. From copyright battles to questioning whether bigger datasets equal better performance, this episode reveals why the future of AI might depend on solving the training data crisis.

For the latest in AI news, analysis and tools, subscribe to the newsletter!

Transcript

AI Training Data Outgrows the Internet

This show examines how Large Language Models (LLMs) are trained and what that means for storytellers and AI technology companies. What is LLM training? What are emerging challenges around data used to train AI? What should storytellers do to protect their work and potentially profit from licensing content for LLM training? Let’s start with what are large language models. Think of them as similar to the engine of a car only LLMs power a generative AI service like Chat GPT or Midjourney. An LLMs job is to predict the next likely “token”— which can be a letter, a word, a pixel, or any other piece of digital content. The predictions are based on the model’s understanding of trillions of tokens in its training corpus and their statistical relationships.

Typically, LLM training starts with a large corpus of data to help a model grasp the basics of a target language. Many of these general purpose datasets are open source like Common Crawl, which has petabytes of web pages, metadata and text data extracts. Common Crawl adds roughly 3-5 billion new pages each month. Another open source dataset is The Pile, an 800GB+ training corpus comprised of over 20 smaller datasets that include Wikipedia, Github, PubMed and many other open source data pools.

The next stage of LLM training involves something called Instruction Tuning. Whereas general training sets like Common Crawl and the Pile familiarize a model with a target language, Instruction Tuning shows the LLM the desired output from different prompts.

Instruction Tuning is why MidJourney outputs an image based on a prompt while RunwayML outputs a video. Instruction Tuning is how you shape a pre-trained model for practical use.

Depending on the use case for a LLM, further training is conducted with Preference Datasets (how do users want output presented), Evaluation Datasets (how are your testing the accuracy of a model’s output), along with Natural Language Processing (NLP) to keep improving capabilities with the target language.

The upshot is that LLMs need data and lots of it to improve their performance and accuracy. However, the bigger and broader these LLMs become, the more acute the challenge of finding enough data and enough of the right data to train them.

Emerging Challenges with LLM Training Data

Recent LLMs like Llama 3 by Meta used 15 trillion tokens for training while Chat GPT 4’s core training set used ~13 trillion tokens. These general purpose LLMs (also known as foundation models) have scraped the almost all the articles, blog posts, comments and any other digital content of the Internet to input. For the vast majority of cases, there was no compensation to the content providers.

AI technology companies claim that foundation models trained on copyrighted work don’t infringe on creator’s rights because the models don’t copy and redistribute the content. Instead, the models learn relationships among the data elements that make up the content.

So when someone enters a text prompt to create something or answer a question, the model builders argue the resulting output is a unique expression. It’s an interpretation of the Fair Use Doctrine of Copyright Law that is being challenged in court by the New York Times and other publications, who are suing Open AI over the results of its foundation models which publishers say returns output that is almost identical to their copyrighted content. Simultaneously, there are other publishers like Conde Nast or Germany’s Axel Springer who license their content specifically for LLM training. Copyright licensing and lawsuits happening in near equal measure demonstrate the current flux over how LLMs use previously published work and how that work should be treated from an intellectual property point-of view.

What’s more, copyright is just one of multiple challenges facing AI model builders now that LLMs have left research labs and have gone mainstream. There are growing questions about the efficacy of training on ever larger datasets as the means for improving LLM performance.

Influential AI researchers like Yann LeCun and Garry Marcus have publicly challenged the idea that giant datasets are the key to better LLMs.

“There will never be enough data; there will always be outliers. This is why driverless cars are still just demos, and why LLMs will never be reliable.”

— Garry Marcus

Additionally, the general quality of LLM output is being questioned. Google suffered embarrassing headlines when its Generative AI search results suggested people use glue to keep pizza toppings from sliding off the dough. Granted that’s a silly example, it’s also an incredibly sticky example in people’s minds about AI trustworthiness. And it highlights problems inherent with training LLMs on the entirety of the Internet, but not to prioritize reputable sources over untrustworthy ones.

LLMs are pattern recognizers and manipulators par excellence. Toss the full text of a book into a LLM with orders to extract the five main themes, and you’ll get a pretty good output. Ask the LLM to remix the content of the book in the style of another (ex. Sci-fi re-rolled as Opera) and you’re likely to get something workable.

But ask the LLM to truly re-imagine a narrative — that is, offer a unique “What If?” scenario, and the limits of 100% statistical reasoning start emerging.

Creating “What If?” scenarios is a lot harder than remixing a corpus of training data. Moreover, just adding more data doesn’t get you closer toward better scenarios. Intelligence is not the same as imagination.

Chat GPT 5 will be a major test for whether the statistical approach for LLM intelligence continues to deliver performance improvements or if the curve is flattening. There are persuasive arguments for both sides. What’s clear going forward is the current methods for training AI models born in research labs must evolve to match different times and higher stakes.

Scraping the Internet (legal and other) will take us only so far while opening up new challenges (IP, bias, authenticity, quality, malware etc). These are significant challenges facing LLM builders.

But they’re also challenges for content providers who must understand better how their media improves LLM performance. Only then will content creators be in a better position to negotiate terms for their intellectual property.

Part 3 of the Storytellers are AI Developers will analyze New Audiences. Many of the most important members aren’t even human.