Harvard and Google to release 1 million public-domain books as AI training dataset

Published on:December 12, 2024

AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that includes in the region of 1 million public-domain books, spanning genres, languages, and authors including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to their age.

The new dataset isn’t available yet, and it’s not clear when or how it will be released. However, it contains books derived from Google’s longstanding book-scanning project, Google Books, and thus Google will be involved in releasing “this treasure trove far and wide.”

Harvard first teased the Institutional Data Initiative (IDI) back in March, outlining its plans to create a “trusted conduit for legal data for AI.” However, not much has been heard from it until its formal launch today, which came with confirmation that the IDI includes financial backing from Microsoft and OpenAI.

The IDI’s executive director Greg Leppert says the dataset’s designed to “level the playing field” by opening up such a huge dataset to anyone — from research labs to AI startups — that want to train their large language models (LLMs).

Harvard and Google to release 1 million public-domain books as AI training dataset

Flipboard lauches Surf, a new app for browing the open social web

Troubled electric truckmaker Nikola offers up to $100 million in common stock

This LEGO Harry Potter Mandrake Set Is Under $50 and Will Arrive In Time to Go Under the Christmas Tree

Everything you need to know about the AI chatbot

Tiny Black Holes Could Have Left Tunnels Inside Earth’s Rocks

io9’s Favorite Replays, Re-Reads, and Rewatches of 2024

Similar Posts