Close Menu
    • Home
    • Events
      • Upcoming Events
      • Videos
        • Machine Can Think Summit 2026
        • Step Dubai Conference 2026
    • Technology & Innovation
    • Business & Marketing
    • Trends & Insights
    • Industry Applications
    • Tutorials & Guides
    What's Hot
    Business & Marketing

    eBay Q2 Revenue Forecast AI Driving Marketplace Success

    By Art RyanApril 30, 20260

    eBay is on track for a strong year with Q2 revenue expected to beat analysts’…

    Pirelli AI Tyre Technology: Revolutionizing Mobility

    April 30, 2026

    Microsoft Cloud Growth AI: Azure Revenue Surge

    April 30, 2026

    Amazon Surprises Investors As Artificial Intelligence Demand Booms

    April 30, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Breaking AI News
    Thursday, April 30
    • Home
    • Events
      • Upcoming Events
      • Videos
        • Machine Can Think Summit 2026
        • Step Dubai Conference 2026
    • Technology & Innovation

      Pirelli AI Tyre Technology: Revolutionizing Mobility

      April 30, 2026

      Pentagon Google AI Deal: Transforming Defense Technology

      April 30, 2026

      SAS Puts AI Governance at the Core of Its Agent Strategy

      April 29, 2026

      Amazon AI Hiring Software Enhances Recruitment Efficiency

      April 29, 2026

      AI Drug Development Johnson & Johnson Impact on Healthcare

      April 28, 2026
    • Business & Marketing

      eBay Q2 Revenue Forecast AI Driving Marketplace Success

      April 30, 2026

      Microsoft Cloud Growth AI: Azure Revenue Surge

      April 30, 2026

      Amazon Surprises Investors As Artificial Intelligence Demand Booms

      April 30, 2026

      Alphabet AI Cloud Revenue Growth Surpasses Expectations

      April 30, 2026

      Big Tech AI Spending 2026: Investment Trends Revealed

      April 29, 2026
    • Trends & Insights

      eBay Q2 Revenue Forecast AI Driving Marketplace Success

      April 30, 2026

      Amazon Surprises Investors As Artificial Intelligence Demand Booms

      April 30, 2026

      SAS Puts AI Governance at the Core of Its Agent Strategy

      April 29, 2026

      Big Tech AI Spending 2026: Investment Trends Revealed

      April 29, 2026

      Oracle & CoreWeave Shares Fall on OpenAI Growth Miss

      April 29, 2026
    • Industry Applications

      Pirelli AI Tyre Technology: Revolutionizing Mobility

      April 30, 2026

      Pentagon Google AI Deal: Transforming Defense Technology

      April 30, 2026

      Amazon AI Hiring Software Enhances Recruitment Efficiency

      April 29, 2026

      AI Drug Development Johnson & Johnson Impact on Healthcare

      April 28, 2026

      Accenture Copilot Rollout Enhances Employee Productivity

      April 28, 2026
    • Tutorials & Guides

      How AI Is Revolutionizing the Future of Travel 2026 with Wellness and Sustainability

      April 19, 2026

      University of Wollongong in Dubai AI initiative boosts future-ready education

      March 31, 2026

      Microsoft AI upgrades Copilot Cowork unveiled for early access users

      March 31, 2026

      Starcloud $11 billion valuation signals AI space race surge

      March 31, 2026

      Flexible AI Factories Power the Future of Energy Grids

      March 30, 2026
    Breaking AI News
    Home » EleutherAI releases massive AI training dataset of licensed and open domain text
    Technology & Innovation

    EleutherAI releases massive AI training dataset of licensed and open domain text

    Art RyanBy Art RyanJune 7, 2025No Comments3 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    EleutherAI, an AI research organization, has released what it claims is one of the largest collections of licensed and open-domain text for training AI models.

    The dataset, called the Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, the Common Pile v0.1 was used to train two new AI models from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims perform on par with models developed using unlicensed, copyrighted data.

    AI companies, including OpenAI, are embroiled in lawsuits over their AI training practices, which rely on scraping the web — including copyrighted material like books and research journals — to build model training datasets. While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission.

    EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI companies, which the organization says has harmed the broader AI research field by making it more difficult to understand how models work and what their flaws might be.

    “[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. “Researchers at some companies we have spoken to have also specifically cited lawsuits as the reason why they’ve been unable to release the research they’re doing in highly data-centric areas.”

    The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open source speech-to-text model, to transcribe audio content.

    EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are evidence that the Common Pile v0.1 was curated carefully enough to enable developers to build models competitive with proprietary alternatives. According to EleutherAI, the models, both of which are 7 billion parameters in size and were trained on only a fraction of the Common Pile v0.1, rival models like Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.

    Parameters, sometimes referred to as weights, are the internal components of an AI model that guide its behavior and answers.

    “In general, we think that the common idea that unlicensed text drives performance is unjustified,” Biderman wrote in her post. “As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”

    The Common Pile v0.1 appears to be in part an effort to right EleutherAI’s historical wrongs. Years ago, the company released The Pile, an open collection of training text that includes copyrighted material. AI companies have come under fire — and legal pressure — for using The Pile to train models.

    EleutherAI is committing to releasing open datasets more frequently going forward in collaboration with its research and infrastructure partners.

    Updated 9:48 a.m. Pacific: Biderman clarified in a post on X that EleutherAI contributed to the release of the datasets and models, but that their development involved many partners, including the University of Toronto, which helped lead the research.

    Source: https://techcrunch.com/

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Art Ryan

    Related Posts

    Pirelli AI Tyre Technology: Revolutionizing Mobility

    April 30, 2026

    Pentagon Google AI Deal: Transforming Defense Technology

    April 30, 2026

    SAS Puts AI Governance at the Core of Its Agent Strategy

    April 29, 2026

    Comments are closed.

    Latest News

    eBay Q2 Revenue Forecast AI Driving Marketplace Success

    April 30, 2026

    Pirelli AI Tyre Technology: Revolutionizing Mobility

    April 30, 2026

    Microsoft Cloud Growth AI: Azure Revenue Surge

    April 30, 2026

    Amazon Surprises Investors As Artificial Intelligence Demand Booms

    April 30, 2026
    Facebook X (Twitter) Pinterest Vimeo WhatsApp TikTok Instagram LinkedIn YouTube Spotify Reddit Snapchat Threads

    AI University

    • Global Universities
    • Universities in Africa
    • Universities in Asia
    • Universities in Europe
    • Universities in Latin America
    • Universities in Middle East
    • Universities in North America
    • Universities in Oceania

    AI Tools & Apps Directory

    • AI Productivity Tools
    • AI Coding Tools
    • AI Voice Tools
    • AI Video Tools
    • AI Image Generators
    • AI Writing Tools

    Info

    • Home
    • About Us
    • AI Organizations & Associations
    • Contact Us

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    © 2026 Breaking AI News.
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.

    Sign Up

    Want to stay ahead In Artificial Intelligence?

     Sign up now and get exclusive breaking AI news and special updates—FREE!