For years, we were told it couldn’t be done. That training large AI models without scraping copyrighted content was a fantasy. Now, a scrappy group of researchers has quietly done exactly that — and it works.
Their new dataset, built entirely from public domain and openly licensed material, may signal a turning point in how ethical AI gets built.
Inside the Breakthrough: What the Team Built
A cross-institutional team of researchers from EleutherAI, Hugging Face, the University of Toronto, and others has created the Common Pile v0.1 — an 8-terabyte dataset composed exclusively of legally sound, ethically sourced text.
This includes expired-copyright books, open-source code, public government transcripts, Creative Commons YouTube captions, and educational resources meant to be shared. Every source was manually vetted for licensing clarity.
The effort began quietly in late 2024 and took just a few months to complete, despite having no backing from major tech companies. The researchers cleaned, curated, and formatted the data themselves — a painstaking process that most AI giants have long avoided.
No forum scraping. No social media data grabs. No grey-area news content.
To test the viability of the dataset, the team trained two models — Comma v0.1-1T and Comma v0.1-2T — each with 7 billion parameters, the same as Meta’s original LLaMA-7B.
The models were trained on 1 to 2 trillion tokens, roughly hundreds of millions of books in scope. The result? Comma v0.1-1T matched or exceeded the performance of similarly sized models on standard benchmarks, including programming tasks.
No, these aren’t GPT-4 killers. But that’s not the point. The real message is this: high-performance models can be trained on ethical, open data alone.
The Bigger Picture: Why This Could Shift AI’s Foundation
This research knocks a hole in one of the most persistent defenses used by AI giants — that using copyrighted content is unavoidable.
Just months ago, OpenAI told UK lawmakers it would be “impossible” to train modern models without copyrighted materials. Now, a few dozen researchers have done it as a side project.
It’s a bit like building a working electric car in your garage, after every major automaker insisted it couldn’t be done without fossil fuels.
Of course, the Common Pile is smaller than the datasets powering GPT-4 or Gemini, which are trained on tens of trillions of tokens. But the Comma models show that quality, legality, and transparency don’t have to be sacrificed for performance — especially in the early stages.
As the team itself points out, expanding the dataset to include more fiction, informal language, and underrepresented dialects will be key. That’s where public institutions, nonprofits, and global contributors could help scale this into something that genuinely rivals Big Tech.
Expert Insight
“Because copyright today covers virtually every sort of human expression… it would be impossible to train today’s leading AI models without using copyrighted materials,” OpenAI told UK Parliament.
This research directly contests that assertion — not in theory, but with working models.
GazeOn’s Take: Where It Could Go From Here
This could mark a turning point in the ethics-versus-scale debate in AI. While Big Tech leans into secrecy and data hoarding, projects like Common Pile show another path: one built on transparency, legality, and community effort.
Don’t expect Google or OpenAI to switch tracks overnight. But with more support, open datasets like this could grow into legitimate competitors — and keep the pressure on.
Question for Readers
Could future foundation models be built entirely on ethical, open data? Or will scale always win? Join the debate in the comments.
Source: GitHub
About Author:
Eli Grid is a technology journalist covering the intersection of artificial intelligence, policy, and innovation. With a background in computational linguistics and over a decade of experience reporting on AI research and global tech strategy, Eli is known for his investigative features and clear, data-informed analysis. His reporting bridges the gap between technical breakthroughs and their real-world implications bringing readers timely, insightful stories from the front lines of the AI revolution. Eli’s work has been featured in leading tech outlets and cited by academic and policy institutions worldwide.

Pingback: Cognizant’s AI Culture Shift: From Fear to Fluid Collaboration
Pingback: China’s Next-Gen Photonic Chips Could Supercharge AI and 6G
Pingback: Meta's V-JEPA 2 Aims to Redefine AI’s Spatial Reasoning Without Video Data
Pingback: Open-Source vs Closed AI: What Businesses Must Know
Pingback: Reddit Takes Legal Aim at Anthropic Over AI Data Scraping
Pingback: AI Tool ‘Extract’ Aims to Cut Planning Delays and Boost UK Housing Push
Pingback: DeepSeek’s AI Efficiency Model Challenges Big Tech's Burn Rate
Pingback: Huawei Unveils HarmonyOS 6 Beta, AI Agents, and Supernode Cloud