9.8 C
London
Friday, January 24, 2025

Harvard Releases Free AI Dataset with One Million Public-Domain Books

Date:

Related stories

Άγριο επεισόδιο της Κωνσταντοπούλου και με τον Βλαχάκο: Προσβάλλετε τη μνήμη των ηρώων που χάθηκαν στα Ίμια

Άγριο επεισόδιο έλαβε χώρα στην ολομέλεια της Βουλής με την Ζωή Κωνσταντοπούλου και τον Νίκο Βλαχάκο να συγκρούονται με σφοδρότητα για την τραγωδία των Τεμπών και την υπόθεση των Ιμίων. Η σύγκρουση έρχεται μετά την κόντρα της κυρίας Κωνσταντοπούλου με τον Δημήτρη Καιρίδη σε επιτροπή της Βουλής, μπροστά στον εμβρόντητο πρέσβη της Πολωνίας.  Η επίθεση…

Μπελέρης: Ο φράχτης του Έβρου θωρακίζει τα ελληνικά και ευρωπαϊκά χερσαία σύνορα, αλλά και τα δικαιώματα των ακριτών Ευρωπαίων πολιτών

Μπελέρης: Ο φράχτης του Έβρου θωρακίζει τα ελληνικά και ευρωπαϊκά χερσαία σύνορα, αλλά και τα δικαιώματα των ακριτών Ευρωπαίων πολιτών Παρέμβαση του Ευρωβουλευτή της Νέας Δημοκρατίας και του Ευρωπαϊκού Λαϊκού Κόμματος στην Ολομέλεια του Ευρωπαϊκού Κοινοβουλίου «Ως μια λύση που αποδεδειγμένα λειτουργεί», χαρακτήρισε ο Ευρωβουλευτής της ΝΔ και του ΕΛΚ, Φρέντης Μπελέρης την χρηματοδότηση φυσικών υποδομών…

Συνάντηση Χρυσοχοΐδη με τον ειδικό γραμματέα μακροπρόθεσμου σχεδιασμού Γιάννη Μαστρογεωργίου

Συνάντηση Χρυσοχοΐδη με τον ειδικό γραμματέα μακροπρόθεσμου σχεδιασμού Γιάννη Μαστρογεωργίου Στόχος να διερευνηθεί η δυνατότητα στενότερης συνεργασίας για την αξιοποίηση του Foresight Συνάντηση με τον ειδικό γραμματέα μακροπρόθεσμου σχεδιασμού, Γιάννη Μαστρογεωργίου, είχε ο υπουργός Προστασίας του Πολίτη, Μιχάλης Χρυσοχοΐδης, προκειμένου να διερευνήσουν τη δυνατότητα στενότερης συνεργασίας για την αξιοποίηση του Foresight, στην ΕΛΑΣ. Συνεργασία που…

Μίμης Δομάζος: Τρίτο κρίσιμο 24ωρο για τον «Στρατηγό» στη ΜΕΘ

Το τρίτο κρίσιμο 24ωρο διασωληνωμένος στην Καρδιολογική Μονάδα...
Harvard’s free AI dataset of nearly one million public-domain books aims to level the playing field in AI development.
The Harvard free AI dataset plans to offer nearly one million public-domain books. Credit: Wikimedia Commons / John Phelan CC BY 3.0

Harvard University has taken a significant step in artificial intelligence research by sharing a free AI dataset containing nearly one million public-domain books. This initiative, led by the university’s Institutional Data Initiative (IDI), aims to give smaller AI developers access to high-quality training data typically reserved for large tech companies.

The dataset includes books scanned through the Google Books project that are no longer under copyright protection, making them freely available to developers and researchers. Microsoft and OpenAI funded the initiative to help make AI development more accessible and fair.

A dataset five times the size of Books3

Harvard’s new dataset is about five times larger than the widely known Books3 dataset, previously used to train AI models like Meta’s Llama. The collection spans multiple genres, periods, and languages.

It includes classic works by authors such as Shakespeare, Charles Dickens, and Dante, alongside lesser-known materials like Czech mathematics textbooks and Welsh dictionaries.

Greg Leppert, executive director of the Institutional Data Initiative, described the project’s goal as an effort to “level the playing field” by offering high-quality data to smaller AI developers. “It’s gone through rigorous review,” he added.

Supporting fair AI development

Microsoft’s vice president for intellectual property, Burton Davis, explained why the company supports the initiative, emphasizing the importance of “pools of accessible data” for AI startups. He clarified, however, that Microsoft doesn’t plan to replace all its AI training data with free public-domain alternatives. “We use publicly available data for the purposes of training our models,” Davis said.

OpenAI expressed similar support. OpenAI’s head of intellectual property and content Tom Rubin said the company was “delighted” to back the project.

The dataset’s potential impact is significant. Smaller AI companies and individual researchers often lack the resources to compile large, high-quality datasets. By offering this collection, Harvard and its partners aim to give these smaller players access to resources typically reserved for major tech firms.

Legal challenges and ethical solutions

The project comes when using copyrighted data in AI training is under legal scrutiny. Several lawsuits are challenging AI companies over their data-collection methods, and the outcomes could reshape how AI models are developed.

Ed Newton-Rex, a former Stability AI executive now leading a nonprofit focused on ethical AI, believes public-domain datasets like Harvard’s provide an alternative to scraping copyrighted data. “Large public domain datasets like these further demolish the ‘necessity defense’ some AI companies use to justify scraping copyrighted work to train their models,” he said.

However, Newton-Rex warned that these datasets must replace copyrighted material, not just supplement it. “These datasets will only have a positive impact if they’re used, probably in conjunction with licensing other data, to replace scraped copyrighted work. If they’re just added to the mix, one part of a dataset that includes the unlicensed life’s work of the world’s creators, they’ll overwhelmingly benefit AI companies,” he explained.

Expanding public access

The IDI also works with the Boston Public Library to scan millions of public-domain newspaper articles. The initiative is open to forming additional partnerships with other organizations to expand its offerings further.

The exact method for distributing the book collection is still being finalized. Harvard has approached Google for assistance in making the dataset publicly available. Kent Walker, Google’s president of global affairs, said the company was “proud to support” the project.

Growing Public-Domain Efforts

Harvard’s dataset will join other public-domain resources supporting AI development. Earlier this year, the French AI startup Pleias introduced Common Corpus, a dataset containing 3 to 4 million public-domain books and periodicals.

Project coordinator Pierre-Carl Langlais said the French Ministry of Culture backs the dataset. It was downloaded over 60,000 times in just one month on the open-source platform Hugging Face.

Latest stories

LEAVE A REPLY

Please enter your comment!
Please enter your name here