AI chatbots need more books to learn from. These libraries are opening their stacks
CAMBRIDGE, Mass. (AP) — Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artistsand others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes "all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'" Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language," said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Yahoo
25 minutes ago
- Yahoo
Longtime spokesperson Tom Bodett sues Motel 6
The Brief Tom Bodett filed a federal lawsuit against Motel 6, claiming the chain used his voice and name without authorization after their contract ended. The dispute stems from a missed $1.2 million payment and the breakdown of a nearly 40-year partnership between Bodett and the motel brand. Motel 6's parent company, G6 Hospitality, said it was surprised by the lawsuit but expressed appreciation for Bodett's contributions. Tom Bodett, whose warm baritone and iconic line "we'll leave the light on for you" made him the voice of Motel 6 for nearly four decades, is suing the motel chain and its parent company for alleged unauthorized use of his name and voice. According to a lawsuit filed Monday in Manhattan federal court, Bodett said he ended his relationship with Motel 6 after its new owner, India-based travel firm OYO, failed to make a $1.2 million annual payment due on Jan. 7. Their contract was set to expire in November. The backstory Despite the contract lapse, Bodett claims Motel 6 continued using his voice and name on its national reservation phone line. The lawsuit alleges violations of both his contract rights and federal trademark law. Bodett said he attempted to reach a confidential settlement that would honor his legacy and protect Motel 6's reputation and franchisees, but accused the company of responding with "misrepresentations, obfuscations, and delay tactics." The lawsuit seeks $1.2 million in owed compensation, along with additional damages and a share of profits. What they're saying A spokesperson for G6 Hospitality, the parent company of Motel 6, said the company was "surprised" by Bodett's lawsuit but indicated it hoped for an amicable resolution. "We appreciate Mr. Bodett's contributions over the past years," the spokesperson said. "Of course, we will continue to advertise keeping the lights on for you." Bodett, when contacted by email, told Reuters: "The complaint says all there is to say." Tom Bodett became Motel 6's lead spokesman in 1986 and said he coined the phrase "we'll leave the light on for you" during an unscripted ad-lib. His voice became synonymous with the brand's down-to-earth identity and was featured in both radio and TV campaigns for decades. In addition to his advertising work, Bodett is known for his appearances on National Public Radio and for narrating several Ken Burns documentaries. What's next The lawsuit, Bodett et al v G6 Hospitality LLC et al, was filed in the U.S. District Court for the Southern District of New York. It could set a notable precedent for voice usage and contract rights in long-term brand relationships, especially as companies change ownership. The Source This report is based on original reporting from Reuters, which first detailed the lawsuit filed by Tom Bodett against Motel 6 and its parent company G6 Hospitality. The article includes direct quotes from legal filings and statements provided to Reuters by both Bodett and Motel 6.

25 minutes ago
Trump clears path for Nippon investment in US Steel, so long as it fits gov't terms
WASHINGTON -- President Donald Trump on Friday signed an executive order paving the way for a Nippon Steel investment in U.S. Steel, so long as the Japanese company complies with a 'national security agreement' submitted by the federal government. Trump's order didn't detail the terms of the national security agreement. But the iconic American steelmaker and Nippon Steel said in a joint statement that the agreement stipulates that approximately $11 billion in new investments will be made by 2028 and includes giving the U.S. government a ' golden share" — essentially veto power to ensure the country's national security interests are protected against cutbacks in steel production. 'We thank President Trump and his Administration for their bold leadership and strong support for our historic partnership," the two companies said. "This partnership will bring a massive investment that will support our communities and families for generations to come. We look forward to putting our commitments into action to make American steelmaking and manufacturing great again.' The companies have completed a U.S. Department of Justice review and received all necessary regulatory approvals, the statement said. 'The partnership is expected to be finalized promptly,' the statement said. U.S. Steel rose $2.66, or 5%, to $54.85 in afterhours trading Friday. Nippon Steel's original bid to buy the Pittsburgh-based U.S. Steel in late 2023 had been valued at $55 per share. The companies offered few details on how the golden share would work, what other provisions are in the national security agreement and how specifically the $11 billion would be spent. White House spokesman Kush Desai said the order 'ensures U.S. Steel will remain in the great Commonwealth of Pennsylvania, and be safeguarded as a critical element of America's national and economic security.' James Brower, a Morrison Foerster lawyer who represents clients in national security-related matters, said such agreements with the government typically are not disclosed to the public, particularly by the government. They can become public, but it's almost always disclosed by a party in the transaction, such as a company — like U.S. Steel — that is publicly held, Brower said. The mechanics of how a golden share would work will depend on the national security agreement, but in such agreements it isn't unusual to give the government approval rights over specific activities, Brower said. U.S. Steel made no filing with the U.S. Securities and Exchange Commission on Friday. Nippon Steel originally offered nearly $15 billion to purchase U.S. Steel in an acquisition that had been delayed on national security concerns starting during Joe Biden's presidency. As it sought to win over American officials, Nippon Steel gradually increased the amount of money it was pledging to invest into U.S. Steel. American officials now value the transaction at $28 billion, including the purchase bid and a new electric arc furnace — a more modern steel mill that melts down scrap — that they say Nippon Steel will build in the U.S. after 2028. Nippon Steel had pledged to maintain U.S. Steel's headquarters in Pittsburgh, put U.S. Steel under a board with a majority of American citizens and keep plants operating. It also said it would protect the interests of U.S. Steel in trade matters and it wouldn't import steel slabs that would compete with U.S. Steel's blast furnaces in Pennsylvania and Indiana. Trump opposed the purchase while campaigning for the White House, and using his authority Biden blocked the transaction on his way out of the White House. But Trump expressed openness to working out an arrangement once he returned to the White House in January. Trump said Thursday that he would as president have 'total control' of what U.S. Steel did as part of the investment. Trump said then that the deal would preserve '51% ownership by Americans,' although Nippon Steel has never backed off its stated intention of buying and controlling U.S. Steel as a wholly owned subsidiary. 'We have a golden share, which I control,' Trump said. Trump added that he was 'a little concerned' about what presidents other than him would do with their golden share, 'but that gives you total control.' The proposed merger had been under review by the Committee on Foreign Investment in the United States, or CFIUS, during the Trump and Biden administrations. The order signed Friday by Trump said the CFIUS review provided 'credible evidence' that Nippon Steel 'might take action that threatens to impair the national security of the United States,' but such risks might be 'adequately mitigated' by approving the proposed national security agreement. The order doesn't detail the perceived national security risk and only provides a timeline for the national security agreement. The White House declined to provide details on the terms of the agreement. The order said the draft agreement was submitted to U.S. Steel and Nippon Steel on Friday. The two companies must successfully execute the agreement as decided by the Treasury Department and other federal agencies that are part CFIUS by the closing date of the transaction. Trump reserves the authority to issue further actions regarding the investment as part of the order he signed on Friday.

Associated Press
an hour ago
- Associated Press
Good Driver Mutuality: Using AI to Redefine Mutuality Service and Lead an Auto Repair Efficiency Revolution
SAN FRANCISCO, CA / ACCESS Newswire / June 13, 2025 / In Silicon Valley, where cutting-edge technology meets community-driven innovation - Good Driver Mutuality (GDM) is reinventing the concept of mutual aid through artificial intelligence. Designed for America's safest drivers, this platform builds a modern community rooted in ancient wisdom: sharing risks, reducing costs, and fostering collective responsibility. By breaking down the barriers of time and distance, GDM is breathing new life into mutual aid for the digital age. 1. From Neighborhood Assistance to Nationwide Networks: A New Mutual Aid Paradigm GDM does not provide insurance products - it operates as a tech-powered community where safe drivers band together to support members' vehicle repairs after accidents with small, voluntary contributions. Unused funds stay with members who remain accident-free, lowering costs for the careful while encouraging safer driving habits. It's a double win: drivers save money, and roads become safer for everyone. What once was neighborly help for fires and floods now finds a new home on GDM's digital platform. With technology as the bridge, even strangers can build trust and share risk. This modern mutual aid model retains the core spirit of empathy and solidarity while adding transparency, efficiency, and scalability. 2. Four AI-Driven Core Capabilities Reshaping Costs and User Experience GDM harnesses the power of advanced AI to drive every facet of its operations, embedding innovation at the core of its business. By integrating cutting-edge technology, GDM ensures long-term competitiveness while consistently delivering meaningful, high-value products. AI-Powered Operational Cost Optimization GDM integrates AI agents across its operations, targeting real-world pain points. Many industry insights remain untapped by large language models, while massive amounts of operational data remain unstructured - a prime opportunity for AI to drive efficiency at scale. Tasks that once relied on human experts are now automated, operating 24/7 with precision while dramatically cutting labor costs. In the notoriously tough U.S. car insurance market, where data sets, customer lifecycles, and regional dynamics make standardized pricing nearly impossible, GDM cracked the code. Its AI rapidly identifies key factors and processes messy, unstructured data. Evolving from a support tool into a fully autonomous agent, it delivers fair, competitive pricing for safe drivers and removes entry barriers for new members. 3 A Tech Framework Restructuring the Service Chain In the U.S. auto insurance market, Loss Adjustment Expenses (LAE) are typically high, making up a significant portion of premiums. GDM eliminates this inefficiency. Members submit incident reports directly through the GDM app and by speaking with a phone agent, where mutuality advisors verify eligibility and initiate repair workflows. GDM's '3A Technology Matrix' (App, Application Programming Interfaces (API), and AI) aims to significantly reduce post-incident expenses, resulting in cost savings of nearly 10%. This is made possible through an intelligent incident triage system: The vast majority of cases- minor, routine, or standard - are handled without human intervention. Only a limited number of complex incidents (unique or challenging situations, or total loss cases) require manual review. This paves the way for zero-touch service and value-driven cost control, redefining efficiency standards for accident management. Miles: AI-Powered 24/7 Multilingual Customer Support GDM's proprietary AI customer service assistant, Miles, has evolved into a multimodal interaction hub capable of instant responses in 18 languages, all within a second. Miles continuously improves its knowledge base through user feedback and business development, maintaining a 99.2% accuracy rate in query resolution. AI Assistants Driving Growth for Affiliates GDM's AI tools also aim to transform how affiliate partners work. Key features include: This AI-human hybrid model offers affiliates a 24/7 virtual assistant, turning data into action and strategies into instant execution. 3. Technology for Good: Building Safer Roads for Everyone GDM's AI-driven mutual aid model has already helped some accident-free members save up to 40% at the end of their mutuality plans. 'Our goal isn't just to reduce costs for safe drivers through smart technology,' says the GDM founder. 'We're building a future of safer, fairer roads where good drivers save more, and communities stay protected.' And here's the bigger picture: when safe drivers are rewarded meaningfully, road safety shifts from passive enforcement to proactive prevention - a win for individuals, families, and society as a whole. About Good Driver Mutuality Good Driver Mutuality (GDM) is an innovative non-insurance alternative to traditional collision and comprehensive insurance, fostering a strong network of responsible drivers who share automotive repair costs. By leveraging AI-powered technology and rewarding safe driving habits, GDM helps reduce accidents and lower costs for its members, ultimately enhancing road safety. To learn more, visit References: CONTACT: [email protected] SOURCE: Good Driver Mutuality press release