AI chatbots need more books to learn from

12-06-2025

CAMBRIDGE, Mass — Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists and others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes 'all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated US$50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'' Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language,' said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.
Matt O'brien, The Associated Press

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Overview of the China Optics Valley Artificial Intelligence Innovation Conference & 2025 Forbes China AI Tech Enterprises Top 50 Release Ceremony

Globe and Mail

11 minutes ago

Globe and Mail

Overview of the China Optics Valley Artificial Intelligence Innovation Conference & 2025 Forbes China AI Tech Enterprises Top 50 Release Ceremony

Wuhan, China--(Newsfile Corp. - August 19, 2025) - On June 27, in the grand banquet hall on the third floor of the China Optics Valley Technology Exhibition Center, the "China Optics Valley Artificial Intelligence Innovation Conference & 2025 Forbes China AI Tech Enterprises Top 50 Release Ceremony" took place, featuring the release of the comprehensive "China's Optics Valley Report", which highlighted major developments and broad insights for the global AI industry. Aiming for "China's Fourth AI Pole" In his opening speech, Shen Yue, a member of the Standing Committee of the CPC Wuhan Municipal Committee and Secretary of the Party Working Committee of the Wuhan East Lake High-tech Development Zone, revealed the "report card" and "roadmap": Industry Scale: By 2024, the AI industry in Optics Valley was expected to exceed 45 billion yuan, with a three-year goal of reaching 100 billion yuan, nurturing three leading companies worth 10 billion yuan and a thousand SMEs. Computing Power Base: With the leading intelligent computing/supercomputing center in China, the national storage base in the country, and the largest optical communication R&D base globally, comprehensive support for AI from storage to computing is ensured. Talent and Funds: A global talent attraction plan has been launched, offering up to 100 million yuan in support for top scientists; a 10 - billion - yuan humanoid robot master fund coordinated by Hubei provincial, Wuhan municipal and East Lake High-tech Development Zone governments is set to launch. Open Scenarios: A massive autonomous driving operational zone reaching 7.7 million people and covering 3,000 square kilometers, a big data pool of 12 million cervical cancer screenings ensures abundant data and scenarios for large models. "Pooling Strengths, Driving Breakthroughs" Industry Collaboration Signing Ceremony: Multiple AI Firms Establish Presence in Optics Valley of China During the "Pooling Strengths, Driving Breakthroughs" collaboration signing ceremony, several leading companies ranked in the 2025 Forbes China AI Tech Enterprises Top 50 list signed strategic cooperation agreements with the East Lake High-Tech Development Zone. Representing core AI sectors—including autonomous driving, XR-powered cultural tourism, industrial quality inspection, healthcare robotics, AIGC content generation, and intelligent computing infrastructure—these companies span a full-stack ecosystem, from foundational large models to end-to-end application scenarios. AI Paradigm Leap From healthcare to cultural tourism, from industry to daily life, the conference highlighted ten transformative breakthroughs shaping the future of AI: Trusted Interaction: Achieving high accuracy in urban hotlines, bank counters services, and autonomous driving operations, elevating AI from mere "understanding" to active "management". AI-Native Creativity: Generating 33.27 million pixel-level images from a single prompt, compressing the traditional "shoot-edit-publish" process into minutes. Medical Scenarios: Diagnostic-level large models advancing evidence-based medicine into an "intent-driven" era, validated by millions of screening data in "no doctor" scenarios. Additionally, Integrating family doctors, fall monitoring, and emergency calls into service robots, deeply engaging with rural areas and communities to address the aging population. Cultural Tourism: Ultra-high-resolution immersive experiences bringing millennia-old murals to life, enabling a 25-minute journey along the Silk Road. Embodied Intelligence: Completing training-deployment loops in 45 days via simple voice commands, achieving a new paradigm of human-AI collaboration. Content Production: AI workstations scaling to 20 million users in 20 months, transforming diverse physical environments into trainable "embodied AI" interfaces. Industrial Quality Inspection: Nano-level defect detection setting new industry benchmarks and driving a national AI initiative targeting 2030. Manufacturing Methodology: Introducing a new industrial AI framework: "scene-first + model collaboration + five core capabilities." Computing Infrastructure Consensus: A shift toward inference dominance - predicting that inference compute demand will soon surpass training, redefining the next generation of AI factories. Autonomous Driving: Achieving global deployment across over 100 projects within three days, with remote safety operators leading commercialization efforts. The ten paradigms collectively outline a future vision of "AI being everywhere, seamlessly integrated into daily life." Four Roundtables, One Message: Scenarios Are King, Collaboration Is Key Addressing the integration bottlenecks in embodied intelligence, the tripartite complementarity of technology, scenarios, and funding, and the emerging consensus around smart manufacturing - defined by "data × model × software-hardware co-design," the four roundtable discussions converged on a shared conclusion: as of 2025, AI competitiveness no longer hinges on showcasing isolated technologies, but on who can fastest embed algorithms into real-world scenarios, integrate computing power into industrial workflows, and attract capital to secure global orders. Wuhan unveiled a series of initiatives to accelerate this transformation, including： A "humanoid robot marathon testbed," to advance embodied AI; A support package for SMEs featuring "computing power vouchers," open data access, and an industry-finance collaboration platform; Making Optics Valley of China the "World Optics Valley" in the AI Era The 2025 Forbes China AI Tech Enterprises Top 50 Award Ceremony was held with great acclaim, honoring: 50 core technology pioneers, 10 breakthrough AI applications, 10 cutting-edge technologies, 10 rising startups, and 15 influential leaders shaping the future of AI. At this pivotal moment, Optics Valley not only celebrate achievement but also illuminated the path forward: "From a beam of light to a city, from a city to an era." The sparks of AI innovation have ignited a torch in Optics Valley of China. Next stop, the World Optics Valley!

WORKHORSE GROUP INVESTOR ALERT by the Former Attorney General of Louisiana: Kahn Swick & Foti, LLC Investigates Merger of Workhorse Group Inc.

Globe and Mail

11 minutes ago

Globe and Mail

WORKHORSE GROUP INVESTOR ALERT by the Former Attorney General of Louisiana: Kahn Swick & Foti, LLC Investigates Merger of Workhorse Group Inc.

Former Attorney General of Louisiana Charles C. Foti, Jr., Esq. and the law firm of Kahn Swick & Foti, LLC ('KSF') are investigating the proposed merger of Workhorse Group Inc. (NasdaqCM: WKHS) and Motiv Electric Trucks. Upon completion of the proposed transaction, Workhorse shareholders will own approximately 26.5% of the combined company. KSF is seeking to determine whether the merger and the process that led to it are adequate, or whether the merger is fair to Workhorse shareholders. If you would like to discuss your legal rights regarding the proposed transaction, you may, without obligation or cost to you, e-mail or call KSF Managing Partner Lewis S. Kahn ( toll free at any time at 855-768-1857, or visit to learn more. To learn more about KSF, whose partners include the Former Louisiana Attorney General, visit

Trump criticizes Powell again, says Fed chair is ‘hurting' the housing industry

Globe and Mail

11 minutes ago

Globe and Mail

Trump criticizes Powell again, says Fed chair is ‘hurting' the housing industry

President Donald Trump said on Tuesday that Federal Reserve Chair Jerome Powell is 'hurting' the housing industry 'very badly' and repeated his call for a big cut to U.S. interest rates. 'Could somebody please inform Jerome 'Too Late' Powell that he is hurting the Housing Industry, very badly? People can't get a Mortgage because of him. There is no Inflation, and every sign is pointing to a major Rate Cut,' Trump wrote on Truth Social. Inflation is well off the highs seen during the pandemic, but some recent data has given a mixed picture and inflation continues to track above the Fed's 2 per cent target range. Trump's latest salvo against Powell comes ahead of the Fed chair's Friday speech at the annual Jackson Hole central banking symposium, where investors will cleave to his every word for hints on his economic outlook and the likelihood of a coming reduction to short-term borrowing costs. The Fed's next policy meeting will be held on September 16-17. Investors and economists are betting the Fed will cut rates by a quarter of a percentage point next month with perhaps another reduction of similar size to come later in the year, far less than the several percentage points that Trump has called for. Trump's Treasury secretary, Scott Bessent, has promoted the idea of a half-point rate cut in September. Trump considering 'major lawsuit' against Fed's Powell over Washington headquarter renovations The U.S. central bank cut its policy rate half a percentage point last September, just before the presidential election, and trimmed it another half of a percentage point in the two months immediately following Trump's electoral victory, but has held it steady in the 4.25 per cent to 4.50 per cent range for all of this year. Fed policymakers have worried that Trump's tariffs could reignite inflation and also felt the labor market was strong enough not to require a boost from lower borrowing costs. The Consumer Price Index rose 0.2 per cent in July, with the 12-month rate through July at 2.7 per cent, unchanged from June. Core CPI, which strips out the volatile food and energy components, increased 3.1 per cent year-over-year in July. Based in part on that data, economists estimated the core Personal Consumption Expenditures Price Index rose 0.3 per cent in July. That would raise the year-on-year increase to 3 per cent in July. The PCE is a key measure tracked by the Fed against its own 2 per cent inflation target. And despite a moderate rise in overall consumer prices in July, producer and import prices jumped, a suggestion that higher consumer prices could be coming as sellers pass higher costs onto households. The inflation picture comes amid a picture of a possible cooling in the labor market, with declines in monthly job gains, although the unemployment rate, at 4.2 per cent, remains low by historical standards. Fed expected to stick with regular-sized rate cut after hot inflation data Trump's online attacks on the Fed and Powell more typically focus on the cost that higher interest rates mean for U.S. government borrowing. High mortgage rates are a key pain point for potential homebuyers who are also facing high and rising home prices due to a dearth of housing supply. Mortgage rates can be loosely tied to the Fed's overnight benchmark rate but more closely track the yield on the 10-year Treasury note, which typically rises and falls based on investors' expectations for economic growth and inflation. A Fed rate cut does not always mean lower long-term rates – indeed after the Fed cut rates last September, mortgage rates – which had been on the decline – rose sharply. In recent weeks the most popular rate – the 30-year fixed mortgage rate – has drifted downward but – at around 6.7 per cent most recently – is still much higher than it had been before inflation took off after the pandemic shock and the Fed began its rate-hike campaign in 2022.

AI chatbots need more books to learn from

Hashtags

Try Our AI Features

Comments

Related Articles

Overview of the China Optics Valley Artificial Intelligence Innovation Conference & 2025 Forbes China AI Tech Enterprises Top 50 Release Ceremony

WORKHORSE GROUP INVESTOR ALERT by the Former Attorney General of Louisiana: Kahn Swick & Foti, LLC Investigates Merger of Workhorse Group Inc.

Trump criticizes Powell again, says Fed chair is ‘hurting' the housing industry

Get Started Now: Download the App