6.1 C
London
Friday, December 13, 2024

Harvard Releases Free AI Dataset with One Million Public-Domain Books

Date:

Related stories

Στην αποχή από την ψηφοφορία για τον προϋπολογισμό καταλήγει ο Αντώνης Σαμαράς

Στην αποχή από την ψηφοφορία για τον προϋπολογισμό καταλήγει ο Αντώνης Σαμαράς Δεν έχει αποφασίσει αν θα τοποθετηθεί στην Ολομέλεια - Εμφανίζεται απογοητευμένος για τον τρόπο, με τον οποίον διεξάγεται η συζήτηση αυτές τις μέρες, ιδίως λόγω των άδειων εδράνων Kαμία διάθεση να δώσει ψήφο εμπιστοσύνης στην κυβέρνηση δεν έχει ο πρώην πρωθυπουργός Αντώνης Σαμαράς,…

Βρούτσης: Ο νόμος για την ασφάλεια αποδίδει, ο κόσμος επιστρέφει στα γήπεδα

Ο νόμος είναι νόμος, και πάνω απ' όλα είναι η διασφάλιση της ασφάλειας των φιλάθλων και όλων αυτών, που θέλουν να πάνε στο γήπεδο για να χαρούν το ποδόσφαιρο, τόνισε ο αναπληρωτής υπουργός Παιδείας, Θρησκευμάτων και Αθλητισμού Γιάννης Βρούτσης, τοποθετούμενος επί του κρατικού προϋπολογισμού 2025. Ο νόμος 5085/24, που ψηφίσαμε με τη στήριξη πολλών κομμάτων,…

Λιβάνιος: Οι εθνικές εκλογές του 2027 θα γίνουν με τον εκλογικό νόμο του 2023

«Οι εθνικές εκλογές του 2027 θα γίνουν με τον εκλογικό νόμο του 2023, όπως άλλωστε έχει δεσμευτεί ο Πρωθυπουργό Κυριάκος Μητσοτάκης» ανέφερε ο υπουργός Εσωτερικών Θεόδωρος Λιβάνιος κατά την συζήτηση του Κρατικού Προϋπολογισμού για το έτος 2025 στην Ολομέλεια. Ο υπουργός, κατά την ομιλία του περιέγραψε τους στόχους και τον οδικό χάρτη της κυβερνητικής πολιτικής του…

Κίμπερλι Γκίλφοϊλ: Η απόφαση του Τραμπ να με στείλει στην Ελλάδα δείχνει πόσο αγαπάει τον λαό της

Με επανειλημμένες δηλώσεις ενθουσιασμού για τον επικείμενο διορισμό...
Harvard’s free AI dataset of nearly one million public-domain books aims to level the playing field in AI development.
The Harvard free AI dataset plans to offer nearly one million public-domain books. Credit: Wikimedia Commons / John Phelan CC BY 3.0

Harvard University has taken a significant step in artificial intelligence research by sharing a free AI dataset containing nearly one million public-domain books. This initiative, led by the university’s Institutional Data Initiative (IDI), aims to give smaller AI developers access to high-quality training data typically reserved for large tech companies.

The dataset includes books scanned through the Google Books project that are no longer under copyright protection, making them freely available to developers and researchers. Microsoft and OpenAI funded the initiative to help make AI development more accessible and fair.

A dataset five times the size of Books3

Harvard’s new dataset is about five times larger than the widely known Books3 dataset, previously used to train AI models like Meta’s Llama. The collection spans multiple genres, periods, and languages.

It includes classic works by authors such as Shakespeare, Charles Dickens, and Dante, alongside lesser-known materials like Czech mathematics textbooks and Welsh dictionaries.

Greg Leppert, executive director of the Institutional Data Initiative, described the project’s goal as an effort to “level the playing field” by offering high-quality data to smaller AI developers. “It’s gone through rigorous review,” he added.

Supporting fair AI development

Microsoft’s vice president for intellectual property, Burton Davis, explained why the company supports the initiative, emphasizing the importance of “pools of accessible data” for AI startups. He clarified, however, that Microsoft doesn’t plan to replace all its AI training data with free public-domain alternatives. “We use publicly available data for the purposes of training our models,” Davis said.

OpenAI expressed similar support. OpenAI’s head of intellectual property and content Tom Rubin said the company was “delighted” to back the project.

The dataset’s potential impact is significant. Smaller AI companies and individual researchers often lack the resources to compile large, high-quality datasets. By offering this collection, Harvard and its partners aim to give these smaller players access to resources typically reserved for major tech firms.

Legal challenges and ethical solutions

The project comes when using copyrighted data in AI training is under legal scrutiny. Several lawsuits are challenging AI companies over their data-collection methods, and the outcomes could reshape how AI models are developed.

Ed Newton-Rex, a former Stability AI executive now leading a nonprofit focused on ethical AI, believes public-domain datasets like Harvard’s provide an alternative to scraping copyrighted data. “Large public domain datasets like these further demolish the ‘necessity defense’ some AI companies use to justify scraping copyrighted work to train their models,” he said.

However, Newton-Rex warned that these datasets must replace copyrighted material, not just supplement it. “These datasets will only have a positive impact if they’re used, probably in conjunction with licensing other data, to replace scraped copyrighted work. If they’re just added to the mix, one part of a dataset that includes the unlicensed life’s work of the world’s creators, they’ll overwhelmingly benefit AI companies,” he explained.

Expanding public access

The IDI also works with the Boston Public Library to scan millions of public-domain newspaper articles. The initiative is open to forming additional partnerships with other organizations to expand its offerings further.

The exact method for distributing the book collection is still being finalized. Harvard has approached Google for assistance in making the dataset publicly available. Kent Walker, Google’s president of global affairs, said the company was “proud to support” the project.

Growing Public-Domain Efforts

Harvard’s dataset will join other public-domain resources supporting AI development. Earlier this year, the French AI startup Pleias introduced Common Corpus, a dataset containing 3 to 4 million public-domain books and periodicals.

Project coordinator Pierre-Carl Langlais said the French Ministry of Culture backs the dataset. It was downloaded over 60,000 times in just one month on the open-source platform Hugging Face.

Latest stories

LEAVE A REPLY

Please enter your comment!
Please enter your name here