AI chatbots need more books to learn from, so more libraries are opening their stacks

13-06-2025

Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists and others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organised by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes "all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitising rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitised would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitisation is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'" Chapel said. 'Our collections are held for public use, and anything we digitised as part of this project will be made public.'
Harvard's collection was already digitised starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language," said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'

Hashtags

#InstitutionalBooks1.0

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Upcoming ROG Xbox Ally Handheld Could Have Reported US Prices

Hans India

4 minutes ago

Hans India

Upcoming ROG Xbox Ally Handheld Could Have Reported US Prices

The company then confirmed that both Xbox Ally handheld leak in October, but has remained tight-lipped about pricing and pre-orders for the next few weeks. Microsoft Confirms October 16th Release Date for ROG Xbox Ally Official pricing and pre-order information has still not been announced. Microsoft told IGN in an interview yesterday that we will be hearing 'in a few weeks' due to the 'macro-economic climate.' On the plus side, official pricing has already leaked. Trusted leaker billbil-kun has now published the list ROG Xbox Ally price in advance of today's official reveal. The ROG Xbox Ally will retail for $549.99 USD, while the Xbox Ally final prices $899.99 USD. Needless to say, these are two very different Xbox handhelds. The lower-tier ROG Xbox Ally is intended to run games at 720p with lower graphical settings, while the ROG Xbox Ally X is intended to target 1080p resolutions. Pre-orders are expected to open later today, with the official release date confirmed as October 16th. Features and Specs Microsoft officially unveiled two ROG handheld console US options at its Xbox Games Showcase 2025 event. These featured a Windows-based Xbox experience that we can play around with inside handheld PCs with a dedicated Xbox button. At the time, the company only shared a 'Holiday 2025' release window, and did not mention pricing. There are also reports that Microsoft is cancelling an in-house Xbox handheld to go with an 'Xbox on Windows' approach to port the Xbox experience to third-party portable gaming PCs like Asus' ROG Xbox Ally series. We know from the June showcase that the ROG gaming handheld includes an AMD Ryzen Z2 Extreme chip, 24GB LPDDR5X‑8000 RAM, 1TB storage, a 7-inch 1080p/120Hz 500‑nit OLED display, dual USB‑C (USB4 + USB 3.2), and 80Wh battery. The cheaper ROG Xbox Ally includes a Ryzen Z2A, 16GB LPDDR5X‑6400 RAM, 512GB storage, the same display, dual USB 3.2 Type‑C, and 60Wh battery. Until Microsoft or Asus make an official confirmation, take all these prices with a grain of salt as leak-only.

iPhone 17 Pro Max: Launch date, availability and pre-order details tipped online

Hindustan Times

15 minutes ago

Hindustan Times

iPhone 17 Pro Max: Launch date, availability and pre-order details tipped online

Apple is preparing for its next major smartphone launch, and attention is already turning toward the iPhone 17 Pro Max. The device will lead the iPhone 17 series, which is expected to include multiple models such as the iPhone 17, iPhone 17 Air, iPhone 17 Pro, and the top-tier iPhone 17 Pro Max. The Cupertino-based tech giant has followed a steady schedule for its flagship releases over the past decade, usually hosting launch events in early September. Looking at this pattern, industry experts expect Apple to unveil the iPhone 17 Pro Max during the first week of September 2025. Reports also suggest that the announcement may fall in the week beginning September 8. Apple generally prefers to hold events on Tuesdays, though it has occasionally shifted to Mondays to avoid clashes with other schedules. Also read: iPhone 17 Pro launch: Price in India, specifications, features, and everything we know so far Also read Looking for a smartphone? To check mobile finder click here. iPhone 17 Pro Max: Pre-Orders and Release Timeline (Expected) If Apple maintains its usual strategy, pre-orders for the iPhone 17 Pro Max will likely open the Friday after the launch event. That would place the pre-order date around September 12, 2025. Shipments and in-store availability usually begin a week later, which could make September 19, 2025, the official release date. This approach keeps Apple's tradition of offering customers a short window between announcement and delivery, which will allow early buyers to secure their devices quickly. Also read: Google Pixel 9 Pro gets ₹23,000 off on Flipkart hours before Pixel 10 Pro launch What the New Flagship May Offer Although Apple has not confirmed specifications or other key details, early reports suggest that the iPhone 17 Pro Max could see improvements in several areas. The Phone 17 Pro Max is expected to feature a more powerful processor, upgraded camera systems, and enhancements in display quality. Battery efficiency is also likely to improve, which aims to continue Apple's focus on performance and usability. As the Pro Max remains the largest and most advanced option in the lineup, it is expected to carry the most premium features. Also read: iPhone 17 launch: 3 reasons why a 120Hz display would make a big difference Alongside hardware upgrades, the upcoming device will launch with the latest version of iOS, which may introduce new software capabilities and system improvements. Apple's yearly flagship launches generally balance design continuity with performance enhancements, and the iPhone 17 Pro Max is expected to follow the same formula.

Karnataka to consider extending Yellow Line to Attibele, says DK Shivakumar

Time of India

21 minutes ago

Time of India

Karnataka to consider extending Yellow Line to Attibele, says DK Shivakumar

Deputy Chief Minister DK Shivakumar said the state government has tasked Hyderabad-based RV Associates with preparing a Detailed Project Report (DPR) on extending the Namma Metro Yellow Line beyond its current Bommasandra terminus toward Attibele. The Dy CM who also handles the portfolio of Bengaluru Development addressing the ongoing state assembly session said many MLAs and residents have been making persistent appeals to his office to extend the line. 'We have to take the metro another 11-12 km to link Jigani. It is a potential place. We will take suitable measures,' Shivakumar said. The deputy CM made the statement in the Assembly during the Question Hour while replying to Congress member B Shivanna, who demanded that the recently inaugurated Yellow Line be extended till Attibele. Shivakumar highlighted the planned 90,000-seat Karnataka Housing Board stadium near Attibele as a major reason to improve connectivity to that region. Attibele is located about 32 km from Bengaluru on NH-7 and serves as the Karnataka state border to Tamil Nadu. It lies near Jigani which hosts a cluster of industrial and warehousing facilities. Bühler (India) Pvt. Ltd., headquartered in the area, manufactures machinery and plants for diverse industrial applications, while Schneider Electric President Systems Ltd operates multiple units in the local KIADB industrial estate. Live Events The region has also emerged as a logistics hub, with brands such as Nike and Flipkart maintaining large warehouses there. The Yellow Line was inaugurated by Prime Minister Narendra Modi on August 10, 2025 which spans 19.15 km with 16 elevated stations and carries a construction cost between ?7,160 crore and ?7,610 crore. As per reports on its first day alone, the Yellow Line transported over 83,000 passengers.