AI Breakthroughs

Researchers Build Powerful AI Model Using Only Open Data

By Eli Grid

Posted on June 13, 2025

Researchers Build Powerful AI Model Using Only Open Data

Image credits: Alina Grubnyak.

For years, we were told it couldn’t be done. That training large AI models without scraping copyrighted content was a fantasy. Now, a scrappy group of researchers has quietly done exactly that — and it works.
Their new dataset, built entirely from public domain and openly licensed material, may signal a turning point in how ethical AI gets built.

Inside the Breakthrough: What the Team Built

A cross-institutional team of researchers from EleutherAI, Hugging Face, the University of Toronto, and others has created the Common Pile v0.1 — an 8-terabyte dataset composed exclusively of legally sound, ethically sourced text.
This includes expired-copyright books, open-source code, public government transcripts, Creative Commons YouTube captions, and educational resources meant to be shared. Every source was manually vetted for licensing clarity.

The effort began quietly in late 2024 and took just a few months to complete, despite having no backing from major tech companies. The researchers cleaned, curated, and formatted the data themselves — a painstaking process that most AI giants have long avoided.
No forum scraping. No social media data grabs. No grey-area news content.

To test the viability of the dataset, the team trained two models — Comma v0.1-1T and Comma v0.1-2T — each with 7 billion parameters, the same as Meta’s original LLaMA-7B.
The models were trained on 1 to 2 trillion tokens, roughly hundreds of millions of books in scope. The result? Comma v0.1-1T matched or exceeded the performance of similarly sized models on standard benchmarks, including programming tasks.

No, these aren’t GPT-4 killers. But that’s not the point. The real message is this: high-performance models can be trained on ethical, open data alone.

The Bigger Picture: Why This Could Shift AI’s Foundation

This research knocks a hole in one of the most persistent defenses used by AI giants — that using copyrighted content is unavoidable.
Just months ago, OpenAI told UK lawmakers it would be “impossible” to train modern models without copyrighted materials. Now, a few dozen researchers have done it as a side project.

It’s a bit like building a working electric car in your garage, after every major automaker insisted it couldn’t be done without fossil fuels.

Of course, the Common Pile is smaller than the datasets powering GPT-4 or Gemini, which are trained on tens of trillions of tokens. But the Comma models show that quality, legality, and transparency don’t have to be sacrificed for performance — especially in the early stages.

As the team itself points out, expanding the dataset to include more fiction, informal language, and underrepresented dialects will be key. That’s where public institutions, nonprofits, and global contributors could help scale this into something that genuinely rivals Big Tech.

Expert Insight

“Because copyright today covers virtually every sort of human expression… it would be impossible to train today’s leading AI models without using copyrighted materials,” OpenAI told UK Parliament.
This research directly contests that assertion — not in theory, but with working models.

GazeOn’s Take: Where It Could Go From Here

This could mark a turning point in the ethics-versus-scale debate in AI. While Big Tech leans into secrecy and data hoarding, projects like Common Pile show another path: one built on transparency, legality, and community effort.
Don’t expect Google or OpenAI to switch tracks overnight. But with more support, open datasets like this could grow into legitimate competitors — and keep the pressure on.

Question for Readers

Could future foundation models be built entirely on ethical, open data? Or will scale always win? Join the debate in the comments.

Source: GitHub

About Author:

Eli Grid is a technology journalist covering the intersection of artificial intelligence, policy, and innovation. With a background in computational linguistics and over a decade of experience reporting on AI research and global tech strategy, Eli is known for his investigative features and clear, data-informed analysis. His reporting bridges the gap between technical breakthroughs and their real-world implications bringing readers timely, insightful stories from the front lines of the AI revolution. Eli’s work has been featured in leading tech outlets and cited by academic and policy institutions worldwide.

See also UK Fires Up Europe's AI Powerhouse in Bristol

Related Items:AI policy, AI regulation, AI training data, copyright law, Featured, generative AI, LLMs, model transparency, OpenAI testimony, UK Parliament

Share

Tweet

Share

Share

Email

Recommended for you

Senators Want AI Giants to Ask First Before Taking Your Data

Beijing wants global AI rules while Washington goes full throttle

Perplexity AI Reaches $18 Billion Valuation

8 Comments

8 Comments

Pingback: Cognizant’s AI Culture Shift: From Fear to Fluid Collaboration

Pingback: China’s Next-Gen Photonic Chips Could Supercharge AI and 6G

Pingback: Meta's V-JEPA 2 Aims to Redefine AI’s Spatial Reasoning Without Video Data

Pingback: Open-Source vs Closed AI: What Businesses Must Know

Pingback: Reddit Takes Legal Aim at Anthropic Over AI Data Scraping

Pingback: AI Tool ‘Extract’ Aims to Cut Planning Delays and Boost UK Housing Push

Pingback: DeepSeek’s AI Efficiency Model Challenges Big Tech's Burn Rate

Pingback: Huawei Unveils HarmonyOS 6 Beta, AI Agents, and Supernode Cloud

You must be logged in to post a comment Login

Leave a Reply
Cancel reply
You must be logged in to post a comment.