, , , , , , ,

AI2 drops biggest open dataset yet for training language models

Language models like GPT-4 and Claude are powerful and useful, but the data on which they are trained is a closely guarded secret. The Allen Institute for AI (AI2) aims to reverse this trend with a new, huge text dataset that’s free to use and open to inspection.

Dolma, as the dataset is called, is intended to be the basis for the research group’s planned open language model, or OLMo (Dolma is short for “Data to feed OLMo’s Appetite). As the model is intended to be free to use and modify by the AI research community, so too (argue AI2 researchers) should be the dataset they use to create it.

This is the first “data artifact” AI2 is making available pertaining to OLMo, and in a blog post, the organization’s Luca Soldaini explains the choice of sources and rationale behind various processes the team used to render it palatable for AI consumption. (“A more comprehensive paper is in the works,” they note at the outset.)

Although companies like OpenAI and Meta publish some of the vital statistics of the datasets they use to build their language models, a lot of that information is treated as proprietary. Apart from the known consequence of discouraging scrutiny and improvement at large, there is speculation that perhaps this closed approach is due to the data not being ethically or legally obtained: for instance, that pirated copies of many authors’ books are ingested.

You can see in this chart created by AI2 that the largest and most recent models only provide some of the information that a researcher would likely want to know about a given dataset. What information was removed, and why? What was considered high versus low quality text? Were personal details appropriately excised?

Chart showing different datasets’ openness or lack thereof.

Of course it is these companies’ prerogative, in the context of a fiercely competitive AI landscape, to guard the secrets of their models’ training processes. But for researchers outside the companies, it makes those datasets and models more opaque and difficult to study or replicate.

AI2’s Dolma is intended to be the opposite of these, with all its sources and processes — say, how and why it was trimmed to original English language texts —  publicly documented.

It’s not the first to try the open dataset thing, but it is the largest by far (3 billion tokens, an AI-native measure of content volume) and, they claim, the most straightforward in terms of use and permissions. It uses the “ImpACT license for medium-risk artifacts,” which you can see the details about here. But essentially it requires prospective users of Dolma to:

  • Provide contact information and intended use cases
  • Disclose any Dolma-derivative creations
  • Distribute those derivatives under the same license
  • Agree not to apply Dolma to various prohibited areas, such as surveillance or disinformation

For those who worry that despite AI2’s best efforts, some personal data of theirs may have made it into the database, there’s a removal request form available here. It’s for specific cases, not just a general “don’t use me” thing.

If that all sounds good to you, access to Dolma is available via Hugging Face.

https://techcrunch.com/2023/08/18/ai2-drops-biggest-open-dataset-yet-for-training-language-models/


November 2024
M T W T F S S
 123
45678910
11121314151617
18192021222324
252627282930  

About Us

Welcome to encircle News! We are a cutting-edge technology news company that is dedicated to bringing you the latest and greatest in everything tech. From automobiles to drones, software to hardware, we’ve got you covered.

At encircle News, we believe that technology is more than just a tool, it’s a way of life. And we’re here to help you stay on top of all the latest trends and developments in this ever-evolving field. We know that technology is constantly changing, and that can be overwhelming, but we’re here to make it easy for you to keep up.

We’re a team of tech enthusiasts who are passionate about everything tech and love to share our knowledge with others. We believe that technology should be accessible to everyone, and we’re here to make sure it is. Our mission is to provide you with fun, engaging, and informative content that helps you to understand and embrace the latest technologies.

From the newest cars on the road to the latest drones taking to the skies, we’ve got you covered. We also dive deep into the world of software and hardware, bringing you the latest updates on everything from operating systems to processors.

So whether you’re a tech enthusiast, a business professional, or just someone who wants to stay up-to-date on the latest advancements in technology, encircle News is the place for you. Join us on this exciting journey and be a part of shaping the future.

Podcasts

TWiT 1006: Underwater Alien Civilizations – Bluesky Growth, Tyson Vs. Paul, AI Granny This Week in Tech (Audio)

How Bluesky, Alternative to X and Facebook, Is Handling Explosive Growth Netflix's Live Mike Tyson Vs. Jake Paul Fight Battling Sound & Streaming Glitches In Lead-Up To Main Event Biden Asked Microsoft to "Raise the Bar on Cybersecurity." He May Have Helped Create an Illegal Monopoly. CFPB looks to place Google under federal supervision, setting up clash Apple's Tim Cook Has Ways to Cope With the Looming Trump Tariffs Apple Removes Another RFE/RL App At Request Of Russian Regulator Here's Why I Decided To Buy 'InfoWars' Elon Musk's X Corp. files notice in Alex Jones' Infowars bankruptcy case Spotify's Plans For AI Generated Music, Podcasts, and Recommendations, According To Its Co-President, CTO, and CPO Gustav Söderström This 'AI Granny' Bores Scammers to Tears Congress ponders underwater alien civilizations, human hybrids, and other unexplained stuff In Memoriam: Thomas E. Kurtz, 1928–2024 Host: Leo Laporte Guests: Alex Kantrowitz, Daniel Rubino, and Iain Thomson Download or subscribe to This Week in Tech at https://twit.tv/shows/this-week-in-tech Get episodes ad-free with Club TWiT at https://twit.tv/clubtwit
  1. TWiT 1006: Underwater Alien Civilizations – Bluesky Growth, Tyson Vs. Paul, AI Granny
  2. TWiT 1005: $125,000 in Baguettes – iPod Turns 23, The $1.1M AI Painting, Roblox
  3. TWiT 1004: Embrace Uncertainty – Political Texts, Daylight Saving Time, Digital Ad Market
  4. TWiT 1003: CrabStrike – Delta Sues Crowdstrike, Hospital AI, Surge Pricing
  5. TWiT 1002: Maximum Iceland Scenario – Data Caps, 3rd Party Android Stores, Nuclear Amazon