AI Breakthroughs

Researchers Build Powerful AI Model Using Only Open Data

Researchers Build Powerful AI Model Using Only Open Data
Image credits: Alina Grubnyak.

For years, we were told it couldn’t be done. That training large AI models without scraping copyrighted content was a fantasy. Now, a scrappy group of researchers has quietly done exactly that — and it works.
Their new dataset, built entirely from public domain and openly licensed material, may signal a turning point in how ethical AI gets built.

Inside the Breakthrough: What the Team Built

A cross-institutional team of researchers from EleutherAI, Hugging Face, the University of Toronto, and others has created the Common Pile v0.1 — an 8-terabyte dataset composed exclusively of legally sound, ethically sourced text.
This includes expired-copyright books, open-source code, public government transcripts, Creative Commons YouTube captions, and educational resources meant to be shared. Every source was manually vetted for licensing clarity.

The effort began quietly in late 2024 and took just a few months to complete, despite having no backing from major tech companies. The researchers cleaned, curated, and formatted the data themselves — a painstaking process that most AI giants have long avoided.
No forum scraping. No social media data grabs. No grey-area news content.

To test the viability of the dataset, the team trained two models — Comma v0.1-1T and Comma v0.1-2T — each with 7 billion parameters, the same as Meta’s original LLaMA-7B.
The models were trained on 1 to 2 trillion tokens, roughly hundreds of millions of books in scope. The result? Comma v0.1-1T matched or exceeded the performance of similarly sized models on standard benchmarks, including programming tasks.

No, these aren’t GPT-4 killers. But that’s not the point. The real message is this: high-performance models can be trained on ethical, open data alone.

The Bigger Picture: Why This Could Shift AI’s Foundation

This research knocks a hole in one of the most persistent defenses used by AI giants — that using copyrighted content is unavoidable.
Just months ago, OpenAI told UK lawmakers it would be “impossible” to train modern models without copyrighted materials. Now, a few dozen researchers have done it as a side project.

It’s a bit like building a working electric car in your garage, after every major automaker insisted it couldn’t be done without fossil fuels.

Of course, the Common Pile is smaller than the datasets powering GPT-4 or Gemini, which are trained on tens of trillions of tokens. But the Comma models show that quality, legality, and transparency don’t have to be sacrificed for performance — especially in the early stages.

As the team itself points out, expanding the dataset to include more fiction, informal language, and underrepresented dialects will be key. That’s where public institutions, nonprofits, and global contributors could help scale this into something that genuinely rivals Big Tech.

Expert Insight

“Because copyright today covers virtually every sort of human expression… it would be impossible to train today’s leading AI models without using copyrighted materials,” OpenAI told UK Parliament.
This research directly contests that assertion — not in theory, but with working models.

GazeOn’s Take: Where It Could Go From Here

This could mark a turning point in the ethics-versus-scale debate in AI. While Big Tech leans into secrecy and data hoarding, projects like Common Pile show another path: one built on transparency, legality, and community effort.
Don’t expect Google or OpenAI to switch tracks overnight. But with more support, open datasets like this could grow into legitimate competitors — and keep the pressure on.

Question for Readers

Could future foundation models be built entirely on ethical, open data? Or will scale always win? Join the debate in the comments.

Source: GitHub

About Author:

Eli Grid is a technology journalist covering the intersection of artificial intelligence, policy, and innovation. With a background in computational linguistics and over a decade of experience reporting on AI research and global tech strategy, Eli is known for his investigative features and clear, data-informed analysis. His reporting bridges the gap between technical breakthroughs and their real-world implications bringing readers timely, insightful stories from the front lines of the AI revolution. Eli’s work has been featured in leading tech outlets and cited by academic and policy institutions worldwide.

See also  AI Systems Double Down on Wrong Answers
8 Comments

Most Popular

GazeOn is your go-to source for the latest happenings in Artificial Intelligence. From breakthrough AI tools to in-depth product reviews, we cover everything that matters in the world of smart tech. Whether you're an enthusiast, developer, or just curious, GazeOn brings AI to your fingertips.

To Top

Pin It on Pinterest

Share This

Share This

Share this post with your friends!