
AI chatbots need more books to learn from; These libraries are opening their stacks
Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century - and in 254 languages - are part of a
Harvard
University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists and others whose creative works have been scooped up without their consent to train AI chatbots.
"It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright," said Burton Davis, a deputy general counsel at
Microsoft
.
Davis said libraries also hold "significant amounts of interesting cultural, historical and language data" that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Live Events
Supported by "unrestricted gifts" from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
Discover the stories of your interest
Blockchain
5 Stories
Cyber-safety
7 Stories
Fintech
9 Stories
E-comm
9 Stories
ML
8 Stories
Edtech
6 Stories
"We're trying to move some of the power from this current AI moment back to these institutions," said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. "
Librarians
have always been the stewards of data and the stewards of information."
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s - a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
"A lot of the data that's been used in AI training has not come from original sources," said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes "all the way back to the physical copy that was scanned by the institutions that actually collected those items," he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from
Wikipedia
, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens - units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company
Meta
, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from "shadow libraries" of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including
Oxford University
's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
"OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning," Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
"We've been very clear that, 'Hey, we're a public library,'" Chapel said. "Our collections are held for public use, and anything we digitized as part of this project will be made public."
Harvard's collection was already digitized starting in 2006 for another tech giant,
Google
, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be "immensely critical" for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
"At a university, you have a lot of pedagogy around what it means to reason," Leppert said. "You have a lot of scientific information about how to run processes and how to run analyses."
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
"When you're dealing with such a large data set, there are some tricky issues around harmful content and language," said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to "help them make their own informed decisions and use AI responsibly."
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Economic Times
22 minutes ago
- Economic Times
AstraZeneca signs AI research deal with China's CSPC for chronic diseases
AstraZeneca has signed an AI-led research agreement with China's CSPC Pharmaceutical Group worth up to $5.3 billion, which would help the Anglo-Swedish drugmaker develop therapies for chronic conditions, it said on Friday. The deal marks the latest effort by AstraZeneca to revive its business in China, its second-biggest market, where it has faced several challenges including the arrest of its China president last year and potential fines related to imports. Under Friday's agreement, the two companies will collaborate to discover and develop pre-clinical candidates, including a small molecule oral therapy for immunological diseases, with CSPC conducting AI-driven research in Shijiazhuang City. "This strategic research collaboration underscores our commitment to innovation to tackle chronic diseases which impact over two billion people globally," AstraZeneca executive Sharon Barr said in a statement. Friday's agreement follows AstraZeneca's announcement in March that it will invest $2.5 billion in a R&D hub in Beijing, and it also marks further investment in AI following collaborations with Immunai, and Tempus AI. AstraZeneca will pay CSPC an upfront fee of $110 million. The Hong Kong-listed firm is also eligible to receive up to $1.62 billion for reaching development milestones and $3.6 billion linked to sales-related milestones, the groups said in separate statements. They signed a licensing deal last October in which AstraZeneca agreed to pay up to $1.92 billion to CSPC to develop a candidate which would boost its cardiovascular pipeline. AstraZeneca and CSPC both have wide-ranging pipeline portfolios, including cancer treatments and those targeting cardiovascular diseases. However, about 80% of CSPC's total revenue comes from its finished drug segment, according Morningstar analysts. The Chinese group said last month it was in negotiations with third parties on new licensing and collaboration. Friday's agreement also gives AstraZeneca the rights to exercise options for exclusive licenses for candidates identified as part of the collaboration.
&w=3840&q=100)

Business Standard
23 minutes ago
- Business Standard
Mont Vert signs $500 mn deal to build medical university in Kazakhstan
Pune-based realty group Mont Vert Group has signed a USD 500 million (around Rs 4,300 crore) contract with Kazakhstan's Big B Corp for developing a medical university and a hospital, according to a statement. Mont Vert Group will be responsible for the construction and development activities related to the project, UK-based SRAM & MRAM Group said in the statement. UK-based SRAM & MRAM Group in partnership with Big B Corporation and KAZIND Medical Group of Kazakhstan is developing a private healthcare facility in Kazakhstan. The agreement was made possible through the efforts of Big B Corporation Director Ajay Bhandari and SRAM & MRAM Group Director Mahendra Joshi, the statement said. The group in October last year announced getting approval from the Kazakh government for 243 hectares of land at Astana and 100 hectares at Almaty for Medical University, a multi-specialty Hospital and a 5-star hotel. The medical college will teach 10,000 students and have a multi-specialty hospital with 1,000 beds. "Mont Vert Group represents the highest standards of Indian real estate leadership," said Sailesh Lachu Hiranandani, Chairman, SRAM & MRAM Group. SRAM & MRAM Group, a global conglomerate with interests across fintech, healthcare, AI, agriculture, biotechnology, and more, recently completed 30 years of operations. (Only the headline and picture of this report may have been reworked by the Business Standard staff; the rest of the content is auto-generated from a syndicated feed.)


Time of India
32 minutes ago
- Time of India
AI 171 crash fallout: DGCA order enhanced checks for Air India Dreamliners with GE engines
NEW DELHI: Air India Boeing 787 Dreamliners powered by General Electric GEnx engines will now operate under enhanced safety inspections. The Directorate General of Civil Aviation (DGCA) issued this directive a day after an AI Dreamliner's (VT-ANB) accident crashed in Ahmedabad seconds after taking off for London, killing 241 people onboard and several others in the building on the roof of which it had crashed. As a 'preventive measure,' the regulator has directed AI to 'carry out additional maintenance actions on B787-8/9 aircrafts equipped with GEnx engines with immediate effect in coordination with the concerned regional DGCA offices.' While B787-8 are Dreamliners of AI, the B787-9 version was in the fleet of Vistara, which merged into AI last Nov. The additional checks mandated for AI include: 'One time check before departure of flight from India from June 15, 2025, (12 am) onwards. Inspection of fuel parameter monitoring and associated system checks. Inspection of cabin air compressor and associated systems. Electronic engine control-system test. Engine fuel driven actuator-operational test and oil system check. Serviceability check of hydraulic system. Review of take-off parameters. 'Flight control inspection' to be introduced in transit inspection till further notice. Power assurance checks to be carried out within two weeks. by Taboola by Taboola Sponsored Links Sponsored Links Promoted Links Promoted Links You May Like Buy Brass Idols - Handmade Brass Statues for Home & Gifting Luxeartisanship Buy Now Undo Closure of maintenance action based upon the review of repetitive snags during the last 15 days on B787-8/9 aircraft at the earliest.' AI has to submit report of these checks to the DGCA for review. Like all initial Dreamliners, VT-ANB had its share of troubles like windshield crack soon after it was inducted in AL fleet in 2014. But Boeing was able to overcome these teething troubles with its mid-range wide body aircraft, with the most serious being overheating of lithium-ion batteries on the Dreamliner. In Jan 2013, all Dreamliners were briefly grounded globally due to safety concerns related to lithium-ion batteries. Once the systemic issues were sorted out, VT-ANB — like all aircraft — also had its share of other snags. Most recently, while flying as A! 148 from London to Hyderabad on Nov 17, 2021, it had diverted to Ankara due to fuel leak. 'There are 1,148 Boeing 787 variants in service globally, with an average age of 7.5 years old. Air India had 34 of the Boeing 787 in service at the time of the incident, including this aircraft. The airline has an additional 20 787 on order and letter of intent for options an additional 24 aircraft. In total, Air India has 190 aircraft, with an average age of 8.4 years old,' according to flight data site Cirium. In a statement, GE Aerospace said: "We are deeply saddened by the loss of Air India Flight AI-171. We extend our heartfelt sympathies to the families and loved ones of those impacted. We have activated our emergency response team, and we are prepared to support our customer and the investigation." Stay informed with the latest business news, updates on bank holidays and public holidays . AI Masterclass for Students. Upskill Young Ones Today!– Join Now