, , , , , , ,

AI2 drops biggest open dataset yet for training language models

Language models like GPT-4 and Claude are powerful and useful, but the data on which they are trained is a closely guarded secret. The Allen Institute for AI (AI2) aims to reverse this trend with a new, huge text dataset that’s free to use and open to inspection.

Dolma, as the dataset is called, is intended to be the basis for the research group’s planned open language model, or OLMo (Dolma is short for “Data to feed OLMo’s Appetite). As the model is intended to be free to use and modify by the AI research community, so too (argue AI2 researchers) should be the dataset they use to create it.

This is the first “data artifact” AI2 is making available pertaining to OLMo, and in a blog post, the organization’s Luca Soldaini explains the choice of sources and rationale behind various processes the team used to render it palatable for AI consumption. (“A more comprehensive paper is in the works,” they note at the outset.)

Although companies like OpenAI and Meta publish some of the vital statistics of the datasets they use to build their language models, a lot of that information is treated as proprietary. Apart from the known consequence of discouraging scrutiny and improvement at large, there is speculation that perhaps this closed approach is due to the data not being ethically or legally obtained: for instance, that pirated copies of many authors’ books are ingested.

You can see in this chart created by AI2 that the largest and most recent models only provide some of the information that a researcher would likely want to know about a given dataset. What information was removed, and why? What was considered high versus low quality text? Were personal details appropriately excised?

Chart showing different datasets’ openness or lack thereof.

Of course it is these companies’ prerogative, in the context of a fiercely competitive AI landscape, to guard the secrets of their models’ training processes. But for researchers outside the companies, it makes those datasets and models more opaque and difficult to study or replicate.

AI2’s Dolma is intended to be the opposite of these, with all its sources and processes — say, how and why it was trimmed to original English language texts —  publicly documented.

It’s not the first to try the open dataset thing, but it is the largest by far (3 billion tokens, an AI-native measure of content volume) and, they claim, the most straightforward in terms of use and permissions. It uses the “ImpACT license for medium-risk artifacts,” which you can see the details about here. But essentially it requires prospective users of Dolma to:

  • Provide contact information and intended use cases
  • Disclose any Dolma-derivative creations
  • Distribute those derivatives under the same license
  • Agree not to apply Dolma to various prohibited areas, such as surveillance or disinformation

For those who worry that despite AI2’s best efforts, some personal data of theirs may have made it into the database, there’s a removal request form available here. It’s for specific cases, not just a general “don’t use me” thing.

If that all sounds good to you, access to Dolma is available via Hugging Face.

https://techcrunch.com/2023/08/18/ai2-drops-biggest-open-dataset-yet-for-training-language-models/


January 2025
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  

About Us

Welcome to encircle News! We are a cutting-edge technology news company that is dedicated to bringing you the latest and greatest in everything tech. From automobiles to drones, software to hardware, we’ve got you covered.

At encircle News, we believe that technology is more than just a tool, it’s a way of life. And we’re here to help you stay on top of all the latest trends and developments in this ever-evolving field. We know that technology is constantly changing, and that can be overwhelming, but we’re here to make it easy for you to keep up.

We’re a team of tech enthusiasts who are passionate about everything tech and love to share our knowledge with others. We believe that technology should be accessible to everyone, and we’re here to make sure it is. Our mission is to provide you with fun, engaging, and informative content that helps you to understand and embrace the latest technologies.

From the newest cars on the road to the latest drones taking to the skies, we’ve got you covered. We also dive deep into the world of software and hardware, bringing you the latest updates on everything from operating systems to processors.

So whether you’re a tech enthusiast, a business professional, or just someone who wants to stay up-to-date on the latest advancements in technology, encircle News is the place for you. Join us on this exciting journey and be a part of shaping the future.

Podcasts

TWiT 1015: Smarter Than a House Cat – TikTok, Trumpcoin, Samsung Unpacked 2025 This Week in Tech (Audio)

Supreme Court Upholds Law That Threatens US TikTok Ban Trumpcoin Texas Sues Allstate Over Its Collection of Driver Data Skyrocketing car-insurance premiums are pushing inflation higher Behind the Curtain — Coming soon: Ph.D.-level super-agents 4 surprise products we could see at Samsung Unpacked 2025 Apple suspends error-strewn AI generated news alerts US Finalizes Rule Banning Smart Cars With Russian, Chinese Tech Natrium 'advanced nuclear' power plant wins Wyoming permit – WyoFile Cash App parent fined $175 million for 'woefully incomplete' response to fraud FDA Proposes Significant Step Toward Reducing Nicotine to Minimally or Nonaddictive Level in Cigarettes and Certain Other Combusted Tobacco Products Host: Leo Laporte Guests: Jason Hiner, Paris Martineau, and Molly White Download or subscribe to This Week in Tech at https://twit.tv/shows/this-week-in-tech Get episodes ad-free with Club TWiT at https://twit.tv/clubtwit Sponsors: joindeleteme.com/twit promo code TWIT ziprecruiter.com/twit NetSuite.com/TWIT canary.tools/twit – use code: TWIT shopify.com/twit
  1. TWiT 1015: Smarter Than a House Cat – TikTok, Trumpcoin, Samsung Unpacked 2025
  2. TWiT 1014: Just Say It's Capitalism – CES 2025, Meta News, Newag DRM
  3. TWiT 1013: Calamari in Crisis – Touching the Sun, Fake Spotify Artists, Banished Words
  4. TWiT 1012: Our Best Of 2024 – The Best Moments From TWiT's 2024
  5. TWiT 1011: The Year in Review – A Look at the Top Stories of 2024