Harvard Releases Free AI Dataset with One Million Public-Domain Books

Harvard’s free AI dataset of nearly one million public-domain books aims to level the playing field in AI development. — The Harvard free AI dataset plans to offer nearly one million public-domain books. Credit: Wikimedia Commons / John Phelan CC BY 3.0

Harvard University has taken a significant step in artificial intelligence research by sharing a free AI dataset containing nearly one million public-domain books. This initiative, led by the university’s Institutional Data Initiative (IDI), aims to give smaller AI developers access to high-quality training data typically reserved for large tech companies.

The dataset includes books scanned through the Google Books project that are no longer under copyright protection, making them freely available to developers and researchers. Microsoft and OpenAI funded the initiative to help make AI development more accessible and fair.

A dataset five times the size of Books3

Harvard’s new dataset is about five times larger than the widely known Books3 dataset, previously used to train AI models like Meta’s Llama. The collection spans multiple genres, periods, and languages.

It includes classic works by authors such as Shakespeare, Charles Dickens, and Dante, alongside lesser-known materials like Czech mathematics textbooks and Welsh dictionaries.

Greg Leppert, executive director of the Institutional Data Initiative, described the project’s goal as an effort to “level the playing field” by offering high-quality data to smaller AI developers. “It’s gone through rigorous review,” he added.

Supporting fair AI development

Microsoft’s vice president for intellectual property, Burton Davis, explained why the company supports the initiative, emphasizing the importance of “pools of accessible data” for AI startups. He clarified, however, that Microsoft doesn’t plan to replace all its AI training data with free public-domain alternatives. “We use publicly available data for the purposes of training our models,” Davis said.

OpenAI expressed similar support. OpenAI’s head of intellectual property and content Tom Rubin said the company was “delighted” to back the project.

The dataset’s potential impact is significant. Smaller AI companies and individual researchers often lack the resources to compile large, high-quality datasets. By offering this collection, Harvard and its partners aim to give these smaller players access to resources typically reserved for major tech firms.

Legal challenges and ethical solutions

The project comes when using copyrighted data in AI training is under legal scrutiny. Several lawsuits are challenging AI companies over their data-collection methods, and the outcomes could reshape how AI models are developed.

Ed Newton-Rex, a former Stability AI executive now leading a nonprofit focused on ethical AI, believes public-domain datasets like Harvard’s provide an alternative to scraping copyrighted data. “Large public domain datasets like these further demolish the ‘necessity defense’ some AI companies use to justify scraping copyrighted work to train their models,” he said.

However, Newton-Rex warned that these datasets must replace copyrighted material, not just supplement it. “These datasets will only have a positive impact if they’re used, probably in conjunction with licensing other data, to replace scraped copyrighted work. If they’re just added to the mix, one part of a dataset that includes the unlicensed life’s work of the world’s creators, they’ll overwhelmingly benefit AI companies,” he explained.

Expanding public access

The IDI also works with the Boston Public Library to scan millions of public-domain newspaper articles. The initiative is open to forming additional partnerships with other organizations to expand its offerings further.

The exact method for distributing the book collection is still being finalized. Harvard has approached Google for assistance in making the dataset publicly available. Kent Walker, Google’s president of global affairs, said the company was “proud to support” the project.

Growing Public-Domain Efforts

Harvard’s dataset will join other public-domain resources supporting AI development. Earlier this year, the French AI startup Pleias introduced Common Corpus, a dataset containing 3 to 4 million public-domain books and periodicals.

Project coordinator Pierre-Carl Langlais said the French Ministry of Culture backs the dataset. It was downloaded over 60,000 times in just one month on the open-source platform Hugging Face.

Harvard Releases Free AI Dataset with One Million Public-Domain Books

Γεωργιάδης: Μετά από 100 χρόνια η Ευγενία θα είναι πιο διάσημη από μένα και εγώ θα είμαι γνωστός ως σύζυγός της

Στην αποχή από την ψηφοφορία για τον προϋπολογισμό καταλήγει ο Αντώνης Σαμαράς

Βρούτσης: Ο νόμος για την ασφάλεια αποδίδει, ο κόσμος επιστρέφει στα γήπεδα

Λιβάνιος: Οι εθνικές εκλογές του 2027 θα γίνουν με τον εκλογικό νόμο του 2023

Κίμπερλι Γκίλφοϊλ: Η απόφαση του Τραμπ να με στείλει στην Ελλάδα δείχνει πόσο αγαπάει τον λαό της

A dataset five times the size of Books3

Supporting fair AI development

Legal challenges and ethical solutions

Expanding public access

Growing Public-Domain Efforts

Γεωργιάδης: Μετά από 100 χρόνια η Ευγενία θα είναι πιο διάσημη από μένα και εγώ θα είμαι γνωστός ως σύζυγός της

Στην αποχή από την ψηφοφορία για τον προϋπολογισμό καταλήγει ο Αντώνης Σαμαράς

Βρούτσης: Ο νόμος για την ασφάλεια αποδίδει, ο κόσμος επιστρέφει στα γήπεδα

Λιβάνιος: Οι εθνικές εκλογές του 2027 θα γίνουν με τον εκλογικό νόμο του 2023

LEAVE A REPLY Cancel reply

Company

Latest

Γεωργιάδης: Μετά από 100 χρόνια η Ευγενία θα είναι πιο διάσημη από μένα και εγώ θα είμαι γνωστός ως σύζυγός της

Στην αποχή από την ψηφοφορία για τον προϋπολογισμό καταλήγει ο Αντώνης Σαμαράς

Βρούτσης: Ο νόμος για την ασφάλεια αποδίδει, ο κόσμος επιστρέφει στα γήπεδα

Popular

Γεωργιάδης: Μετά από 100 χρόνια η Ευγενία θα είναι πιο διάσημη από μένα και εγώ θα είμαι γνωστός ως σύζυγός της

Στην αποχή από την ψηφοφορία για τον προϋπολογισμό καταλήγει ο Αντώνης Σαμαράς

Βρούτσης: Ο νόμος για την ασφάλεια αποδίδει, ο κόσμος επιστρέφει στα γήπεδα

Sitemap