logo
AI chatbots need more books to learn from. These libraries are opening their stacks

AI chatbots need more books to learn from. These libraries are opening their stacks

CAMBRIDGE, Mass. (AP) — Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artistsand others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes 'all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'' Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language,' said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Week in Review: WWDC 2025 recap
Week in Review: WWDC 2025 recap

Yahoo

time33 minutes ago

  • Yahoo

Week in Review: WWDC 2025 recap

Welcome back to Week in Review! We have lots for you this week, including what came out of WWDC 2025; The Browser Company's AI browser; OpenAI's partnership with Mattel; and updates to your iPad. Have a great weekend! The Apple experience: We kicked the week off with WWDC 2025, Apple's Worldwide Developers Conference, where the company showed off a newly designed iOS 26, new features across its products, and much more. There was considerable pressure on Apple this year to build on its promises and to make amends to developers as it lags behind in AI and faces continued legal challenges over its App Store. Snack hack: U.S. grocery distribution giant United Natural Foods (UNFI) was hit by a cyberattack, the company confirmed Tuesday. Much of UNFI's external-facing systems were offline, including web systems used by suppliers and customers, as well as the company's VPN products. Whole Foods was one of the victims, and it told staff that the cyberattack was affecting UNFI's 'ability to select and ship products from their warehouses' and that this will 'impact our normal delivery schedules and product availability.' Public debut: Chime's much-anticipated public debut finally arrived, with the company raising $864 million in its IPO. Iconiq was one of Chime's many backers taking a victory lap at its graduation to become a public company. This is TechCrunch's Week in Review, where we recap the week's biggest news. Want this delivered as a newsletter to your inbox every Saturday? Sign up here. Not to be outdone: Google rolled out Android 16 to Pixel phones, adding group chat to RCS, AI-powered edit suggestions to Google Photos, and support for corporate badges in Google Wallet. Cabs are here: Elon Musk has spent years claiming that Teslas would be able to drive themselves. Apparently the time has come — maybe? Musk said this week that Tesla will start offering public rides in driverless vehicles in Austin, Texas, on June 22. An AI browser: The Browser Company said last year that it's going to stop supporting and developing its Arc browser, which, although popular, was never able to reach scale. The startup has since been busy developing an AI-first browser called Dia. And another one: OpenAI released o3-pro, which is a version of o3, a reasoning model that the startup launched earlier this year. As opposed to conventional AI models, reasoning models work through problems step by step, allowing them to perform more reliably in domains like physics, math, and coding. In other news, Sam Altman posted on X to say that his company's first open model in years will be delayed until later this summer. Desperately seeking: Now that people can ask a chatbot for answers — sometimes generated from news content taken without a publisher's knowledge — there's no need to click on Google's blue links. And that's hurting publishers. Cool? Mattel and OpenAI are teaming up to create an 'AI-powered product,' whatever that is. As part of the deal, Mattel employees will also get access to OpenAI tools like ChatGPT Enterprise to 'enhance product development and creative ideation.' 'A privacy disaster': Reporter Amanda Silberling tried out the Meta AI app and found that it's publicly sharing people's queries. 'Meta does not indicate to users what their privacy settings are as they post, or where they are even posting to. So, if you log into Meta AI with Instagram, and your Instagram account is public, then so too are your searches about how to meet 'big booty women,'' she writes. iPad for work: iPadOS 26 will bring new features to the 15-year-old device that might actually make it usable for a full day of work. A wave of recent headlines and posts has raised questions about Bluesky, from concerns about slowing growth to claims that the platform is turning into a left-leaning echo chamber and that its users are too serious. While those critiques capture part of the conversation, they don't reflect the full picture of what Bluesky is working toward. But if left unchecked, those perceptions could pose a real challenge to the platform's future growth. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Zeta Global Holdings (NYSE:ZETA) Co-founder John Sculley Retires
Zeta Global Holdings (NYSE:ZETA) Co-founder John Sculley Retires

Yahoo

time37 minutes ago

  • Yahoo

Zeta Global Holdings (NYSE:ZETA) Co-founder John Sculley Retires

Zeta Global Holdings recently experienced a 13% decline in its share price over the past week, which contrasts with a flat performance in the broader market. This downturn coincides with significant developments within the company, including the retirement of Co-founder and Vice Chairman John Sculley. While this leadership change could have contributed to investor uncertainty, the concurrent launch of 'Zeta Answers', an AI-driven intelligence framework, positions the company as a continuing innovator in marketing technology. Despite the market's stability and a positive growth outlook, these internal changes may have added weight to the downward price movement. We've identified 1 risk with Zeta Global Holdings and understanding the impact should be part of your investment process. Trump has pledged to "unleash" American oil and gas and these 22 US stocks have developments that are poised to benefit. The recent changes at Zeta Global Holdings, namely the retirement of Co-founder and Vice Chairman John Sculley and the launch of 'Zeta Answers', have stirred mixed reactions among investors. While the leadership transition may induce short-term uncertainty, the introduction of an AI-driven framework could reinforce the company's innovative edge in marketing technology. Over the past three years, Zeta's total shareholder return stood at 131.18%, highlighting a positive return despite recent volatility. However, in the past year, Zeta underperformed compared to the US Software industry's 19.1% return and the broader US market's 10.6% increase. The internal developments might influence revenue and earnings projections, especially with a focus on AI and acquisitions like LiveIntent, aimed at boosting market share and profitability. Analysts project Zeta's revenue to grow at a 14.2% annual rate, surpassing the US market average of 8.7% per year. Nevertheless, a 13% share price decline over the past week contrasts with a stable broader market, indicating investor wariness towards these changes. Despite this decline, the current share price of US$13.45 offers a substantial discount to the consensus analyst price target of US$30.17, suggesting room for future appreciation if forecasted growth materializes. Navigate through the intricacies of Zeta Global Holdings with our comprehensive balance sheet health report here. This article by Simply Wall St is general in nature. We provide commentary based on historical data and analyst forecasts only using an unbiased methodology and our articles are not intended to be financial advice. It does not constitute a recommendation to buy or sell any stock, and does not take account of your objectives, or your financial situation. We aim to bring you long-term focused analysis driven by fundamental data. Note that our analysis may not factor in the latest price-sensitive company announcements or qualitative material. Simply Wall St has no position in any stocks mentioned. Companies discussed in this article include NYSE:ZETA. This article was originally published by Simply Wall St. Have feedback on this article? Concerned about the content? with us directly. Alternatively, email editorial-team@ Sign in to access your portfolio

Carvana (NYSE:CVNA) Reports Remarkable First-Quarter Earnings Growth
Carvana (NYSE:CVNA) Reports Remarkable First-Quarter Earnings Growth

Yahoo

time37 minutes ago

  • Yahoo

Carvana (NYSE:CVNA) Reports Remarkable First-Quarter Earnings Growth

Carvana has been making significant advancements in its business operations, including the recent launch of same-day vehicle delivery in Denver and the establishment of an Inspection and Reconditioning Center in Nashville. These strategic expansions, aimed at enhancing customer convenience and operational capacity, coincide with a substantial quarterly share price increase of 64%. During the same period, Carvana reported remarkable first-quarter earnings growth, which further aligns with the positive market sentiment despite the market remaining largely flat in recent days. These developments underscore the company's commitment to improving service delivery and its adaptability to market demands. Carvana has 4 risks (and 1 which is significant) we think you should know about. Uncover the next big thing with financially sound penny stocks that balance risk and reward. The introduction of same-day vehicle delivery in Denver and the new Inspection and Reconditioning Center in Nashville could have a significant influence on Carvana's operational efficiency and customer satisfaction. These advancements, coinciding with a significant share price jump, align with the company's aim to enhance service delivery. Over the last three years, Carvana's total return, including share price and dividends, increased by a notably large percentage, showcasing the company's substantial growth, even as the annual industry return was lower. In the past year, Carvana outperformed both the US Specialty Retail industry and the broader market, with returns surpassing industry averages. This differentiation highlights Carvana's capacity to generate notable shareholder value amidst broader market conditions. The news mentioned may further bolster revenue and earnings forecasts as expansion and technology adoption are expected to foster sales growth and improved margins. With the share price closely aligning with the analyst consensus price target of $259.81, the market shows confidence in Carvana's capacity to meet these targets, considering both their ambitious growth strategies and potential risks. However, balancing debt levels and operational scaling remains crucial as the company navigates its path forward. Understand Carvana's earnings outlook by examining our growth report. This article by Simply Wall St is general in nature. We provide commentary based on historical data and analyst forecasts only using an unbiased methodology and our articles are not intended to be financial advice. It does not constitute a recommendation to buy or sell any stock, and does not take account of your objectives, or your financial situation. We aim to bring you long-term focused analysis driven by fundamental data. Note that our analysis may not factor in the latest price-sensitive company announcements or qualitative material. Simply Wall St has no position in any stocks mentioned. Companies discussed in this article include NYSE:CVNA. This article was originally published by Simply Wall St. Have feedback on this article? Concerned about the content? with us directly. Alternatively, email editorial-team@

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store