logo
AI chatbots need more books - these libraries are opening their stacks

AI chatbots need more books - these libraries are opening their stacks

The Star13-06-2025
Everything ever said on the Internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century - and in 254 languages - are part of a Harvard University collection being released to AI researchers on Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists and others whose creative works have been scooped up without their consent to train AI chatbots.
"It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold "significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by "unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
"We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab.
"Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper.
One of the earlier works is from the 1400s - a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
Harvard's collection was already digitised starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books. Photo: AP
"A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes "all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books.
They just needed lots of what computer scientists call tokens - units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from "shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated US$50mil (RM211mil) this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitising rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the US, the library made clear that any information it digitised would be for everyone, said Jessica Chapel, its chief of digital and online services.
"OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitisation is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'A lot of the data that's been used in AI training has not come from original sources,' says Leppert, executive director at the Institutional Data Initiative. Photo: AP
"We've been very clear that, 'Hey, we're a public library,'" Chapel said. "Our collections are held for public use, and anything we digitised as part of this project will be made public.'
Harvard's collection was already digitised starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the US Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the US typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared on Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be "immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
"At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. "You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
"When you're dealing with such a large data set, there are some tricky issues around harmful content and language," said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to "help them make their own informed decisions and use AI responsibly.' - AP
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Trump orders deregulation of space launches, paving way for Musk's Mars ambitions
Trump orders deregulation of space launches, paving way for Musk's Mars ambitions

New Straits Times

time2 minutes ago

  • New Straits Times

Trump orders deregulation of space launches, paving way for Musk's Mars ambitions

WASHINGTON: US President Donald Trump signed an executive order Wednesday easing regulations for the private space industry, including eliminating some environmental reviews, in a move likely to please his erstwhile advisor Elon Musk. The executive order, which said it aimed to "substantially" increase the number of space launches in the United States, was described by an environmental group as "reckless." Since returning to the White House in January, Trump has talked up several space missions including sending humans to the Moon and Mars. The Moon and Mars missions are planned to get a ride on the massive Starship rocket of Musk's private firm SpaceX. However, Starship has had a series of setbacks, with its latest routine test ending in a fiery explosion in June. SpaceX dominates the global launch market, with its various-sized rockets blasting off more than 130 times last year – and that number looks set to rise after Trump's executive order. "It is the policy of the United States to enhance American greatness in space by enabling a competitive launch marketplace and substantially increasing commercial space launch cadence" by 2030, the order read. The change could well benefit Musk, who has long advocated for deregulation of the space industry. The world's richest man was previously a close advisor to Trump before the pair had a dramatic, public falling out in July. The executive order also called on Transportation Secretary Sean Duffy – who was at the signing and is currently NASA's administrator – "to eliminate or expedite the Department of Transportation's environmental reviews" for launches. SpaceX has been repeatedly criticised over the environmental impact at the sites where Starship, the largest and most powerful rocket in history, blasts off. The US-based nonprofit Center for Biological Diversity said Trump's new executive order "paves the way for the massive destruction of protected plants and animals." "This reckless order puts people and wildlife at risk from private companies launching giant rockets that often explode and wreak devastation on surrounding areas," the centre's Jared Margolis said in a statement. Musk's dreams of colonising Mars rely on the success of Starship, and SpaceX has been betting that its "fail fast, learn fast" ethos will eventually pay off. The Federal Aviation Administration approved an increase in annual Starship rocket launches from five to 25 in early May, stating that the increased frequency would not adversely affect the environment.--AFP

W Energy Brings Advanced AI Energy Forecasting Back to Australia in Partnership with Simble
W Energy Brings Advanced AI Energy Forecasting Back to Australia in Partnership with Simble

Malay Mail

time3 hours ago

  • Malay Mail

W Energy Brings Advanced AI Energy Forecasting Back to Australia in Partnership with Simble

SYDNEY, AUSTRALIA - Media OutReach Newswire - 14 August 2025 - Australian clean energy innovation takes a leap forward as W Energy, a new energy technology company, teams up with long-established ASX-listed Simble Solutions (ASX: SIS) to deploy AI-driven energy forecasting and management solutions nationwide. Simble has also formally engaged Yongxin Sun, W Energy's founder and former AI Clean Energy GLOBAL lead, to provide technical support and strategic guidance. Mr. Sun combines expertise in finance, large-scale energy project modelling, and applied AI technology.W Energy's AI forecasting platform originated in Australia to enhance solar and battery performance predictions while integrating financial modelling for investors and operators. Due to limited local data early on, the system was trialed in Southeast Asia across Cambodia, Vietnam, and the Philippines. These deployments delivered diverse climate and grid datasets, mature near real-time forecasting capabilities, and proven commercial benefits such as reduced investment risk and optimized storage commercially mature, W Energy and Simble will begin rolling out projects in New South Wales, expanding to Queensland and Victoria. The partnership leverages W Energy's predictive AI for generation, demand, and pricing optimization alongside Simble's established market presence and energy monitoring tools. Together, they will serve commercial buildings, industrial precincts, and regional grid networks, supporting virtual power plants, dynamic pricing response, and grid collaboration aligns with Australia's energy transition goals, using AI to boost renewable penetration and grid flexibility. The platform integrates real-time IoT sensor data with historical weather and market information, applies adaptive algorithms for storage dispatch, and incorporates financial scenario modelling to assess project returns under varying conditions—all secured to comply with Australian data benefits include higher forecasting accuracy across diverse weather conditions, direct integration of financial metrics into operational decisions, and scalability from small commercial sites to utility-scale assets. Potential applications range from energy cost reductions for commercial customers to enhanced stability in high-renewable regions.W Energy and Simble plan initial deployments in NSW commercial and industrial sites while collaborating with universities and research institutions to refine the AI platform using local data, further improving its accuracy, adaptability, and #WEnergy The issuer is solely responsible for the content of this announcement.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store