Harvard University has taken a significant step in artificial intelligence research by sharing a free AI dataset containing nearly one million public-domain books. This initiative, led by the university’s Institutional Data Initiative (IDI), aims to give smaller AI developers access to high-quality training data typically reserved for large tech companies.
The dataset includes books scanned through the Google Books project that are no longer under copyright protection, making them freely available to developers and researchers. Microsoft and OpenAI funded the initiative to help make AI development more accessible and fair.
A dataset five times the size of Books3
Harvard’s new dataset is about five times larger than the widely known Books3 dataset, previously used to train AI models like Meta’s Llama. The collection spans multiple genres, periods, and languages.
It includes classic works by authors such as Shakespeare, Charles Dickens, and Dante, alongside lesser-known materials like Czech mathematics textbooks and Welsh dictionaries.
Greg Leppert, executive director of the Institutional Data Initiative, described the project’s goal as an effort to “level the playing field” by offering high-quality data to smaller AI developers. “It’s gone through rigorous review,” he added.
Supporting fair AI development
Microsoft’s vice president for intellectual property, Burton Davis, explained why the company supports the initiative, emphasizing the importance of “pools of accessible data” for AI startups. He clarified, however, that Microsoft doesn’t plan to replace all its AI training data with free public-domain alternatives. “We use publicly available data for the purposes of training our models,” Davis said.
OpenAI expressed similar support. OpenAI’s head of intellectual property and content Tom Rubin said the company was “delighted” to back the project.
The dataset’s potential impact is significant. Smaller AI companies and individual researchers often lack the resources to compile large, high-quality datasets. By offering this collection, Harvard and its partners aim to give these smaller players access to resources typically reserved for major tech firms.
Legal challenges and ethical solutions
The project comes when using copyrighted data in AI training is under legal scrutiny. Several lawsuits are challenging AI companies over their data-collection methods, and the outcomes could reshape how AI models are developed.
Ed Newton-Rex, a former Stability AI executive now leading a nonprofit focused on ethical AI, believes public-domain datasets like Harvard’s provide an alternative to scraping copyrighted data. “Large public domain datasets like these further demolish the ‘necessity defense’ some AI companies use to justify scraping copyrighted work to train their models,” he said.
However, Newton-Rex warned that these datasets must replace copyrighted material, not just supplement it. “These datasets will only have a positive impact if they’re used, probably in conjunction with licensing other data, to replace scraped copyrighted work. If they’re just added to the mix, one part of a dataset that includes the unlicensed life’s work of the world’s creators, they’ll overwhelmingly benefit AI companies,” he explained.
Expanding public access
The IDI also works with the Boston Public Library to scan millions of public-domain newspaper articles. The initiative is open to forming additional partnerships with other organizations to expand its offerings further.
The exact method for distributing the book collection is still being finalized. Harvard has approached Google for assistance in making the dataset publicly available. Kent Walker, Google’s president of global affairs, said the company was “proud to support” the project.
Growing Public-Domain Efforts
Harvard’s dataset will join other public-domain resources supporting AI development. Earlier this year, the French AI startup Pleias introduced Common Corpus, a dataset containing 3 to 4 million public-domain books and periodicals.
Project coordinator Pierre-Carl Langlais said the French Ministry of Culture backs the dataset. It was downloaded over 60,000 times in just one month on the open-source platform Hugging Face.