logo
AI chatbots need more books to learn from. These libraries are opening their stacks

AI chatbots need more books to learn from. These libraries are opening their stacks

Independenta day ago

Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artistsand others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes "all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University 's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'" Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language," said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Millions of drivers due free TV upgrade in car that lets you watch telly – and it'll work even on old motors
Millions of drivers due free TV upgrade in car that lets you watch telly – and it'll work even on old motors

The Sun

time36 minutes ago

  • The Sun

Millions of drivers due free TV upgrade in car that lets you watch telly – and it'll work even on old motors

A CLEVER upgrade to cars across the world will let you watch video on your motor's built-in screen. It's all thanks to a new update to Apple CarPlay, which is in millions of cars globally – and unsurprisingly, there's a big catch. 7 7 The feature will only work if you're parked. That's to prevent drivers from catching up on Love Island on the M25 – a dangerous combination. With CarPlay, your iPhone takes over your car's screen, giving you access to your apps, music, contacts and more. And with the iOS 26 update for iPhone later this year, you'll be able to beam video from your mobile to your car's display. This works through the same AirPlay feature that lets you stream iPhone video to your telly. So you could play a video that was saved to your Photos app. Or you could use one of the many TV apps that support AirPlay to easily stream content to your Apple CarPlay monitor in your car. That includes Apple TV+ and YouTube – or any other app that offers CarPlay. "AirPlay video in the car enables people to watch their favourite videos from iPhone right on their CarPlay display when they aren't driving," Apple explains. The other catch is that your carmaker will need to switch the feature on. Apples's big announcements from WWDC with a flurry new features for the gadgets you already own CarPlay is managed by Apple, but your car provider still needs to activate the new ability to watch TV. And it won't go live until at least September, when we're expecting iOS 26 to roll out to millions of iPhone models as a free update. But plenty of cars support CarPlay. CarPlay launched back in 2014, and more than 800 models of motor across 40 brands now support it. 7 7 Apple recently revealed that it's used more than 600 million times a day. So there's a good chance that the new feature will come to your car in the near future. DRIVE ON That wasn't the only car upgrade from Apple. This week, the company held its WWDC 2025 event in California, revealing all of its major software updates for the year ahead. IOS 26 SUPPORTED DEVICES – THE FULL LIST Here are the iOS 26 supported devices... iPhone 16e iPhone 16 ‌iPhone 16‌ Plus iPhone 16 Pro ‌iPhone 16 Pro‌ Max iPhone 15 ‌iPhone 15‌ Plus ‌iPhone 15‌ Pro ‌iPhone 15‌ Pro Max ‌iPhone‌ 14 ‌iPhone‌ 14 Plus ‌iPhone‌ 14 Pro ‌iPhone‌ 14 Pro Max ‌iPhone‌ 13 ‌iPhone‌ 13 mini ‌iPhone‌ 13 Pro ‌iPhone‌ 13 Pro Max ‌iPhone‌ 12 iPhone 12 mini ‌iPhone‌ 12 Pro iPhone 12 Pro Max ‌iPhone‌ 11 ‌iPhone‌ 11 Pro iPhone 11 Pro Max iPhone SE (3rd gen) iPhone‌ SE (2nd gen) Picture Credit: Apple For a start, Apple CarPlay is getting the new Liquid Glass look that other Apple gadgets are getting later this year. That gives the apps a glassy new aesthetic with a translucent design. And you'll be able to send Tapbacks to texts – they're the quick emoji reactions that are already available in the Messages app on iPhone. There's support for widgets on screen, as well as Live Activities (which could show sports scores, for instance). 7 And Apple has also finally debuted CarPlay Ultra, which is the supercharged version of CarPlay only currently available with one car brand. Instead of just replacing your infotainment screen, CarPlay Ultra also serves as a customisable instrument panel (showing your speed and fuel and so on), as well as allowing you to control car functions like heating. During a sit-down interview with The Sun, Apple's Greg 'Joz'' Joswiak said: 'We're also bringing out that next generation of CarPlay, which we just brought out for the first time about a month ago." Joz continued: 'CarPlay is this beautiful ability to use the features of your phone. 7 'We wanted to have an experience with CarPlay Ultra that allows you – through that same interface that you can control your phone – to also control the aspects of your car 'Whether it's the seat heaters or the heating system in general, or even the car radio. 'Things that are outside of normal iPhone capabilities. To control these with CarPlay Ultra. 'We're very happy with the initial rollout of that. Some of the first vehicles started coming out last month. There's plenty more brands working on it.' 7 CarPlay Ultra is currently available in new models from British motoring giant Aston Martin, but it's expected to arrive in other vehicles in the future too.

Apple has made a subtle change to an iPhone icon that has left users FURIOUS - as one vents 'Apple has made a real mess'
Apple has made a subtle change to an iPhone icon that has left users FURIOUS - as one vents 'Apple has made a real mess'

Daily Mail​

time42 minutes ago

  • Daily Mail​

Apple has made a subtle change to an iPhone icon that has left users FURIOUS - as one vents 'Apple has made a real mess'

It's one of the most-used apps on iPhone. But if you've downloaded Apple's iOS 26 Beta, you may have noticed a subtle change to the Clock app. Since 2013, the iPhone's Clock app icon has featured a white, circular clock against a black square. However, in iOS 26, which was announced this week, the icon has been upgraded. Now, it's entirely white, with the 60 clock lines added around the outer edge. While this doesn't sound like a big deal, the simple change has left many iPhone users furious. Taking to X, one wrote: 'Am I the only one who didn't like the clock icon?' Another added: 'Apple has made a real mess with this 26. I won't be downloading it ever.' Apple announced its next major iPhone update, iOS 26, at its Worldwide Developer Conference (WWDC) this week. The update, described by Apple as 'beautiful', brings translucent, glass-like effects to app icons, the lock screen, and home screen. 'Meticulously crafted by rethinking the fundamental elements that make up our software, the new design features an entirely new material called Liquid Glass,' explained Alan Dye, Apple's vice president of Human Interface Design. 'It combines the optical qualities of glass with a fluidity only Apple can achieve, as it transforms depending on your content or context. 'It lays the foundation for new experiences in the future and, ultimately, it makes even the simplest of interactions more fun and magical.' As part of the update, Apple has tweaked several of its iPhone icons. The AirDrop icon, which previously featured blue lines on a white background, now features a blue background with semi-transparent white lines. The Translate icon has also been given an update, with a blue background in place of a black one. One unhappy user tweeted: 'Apple has made a real mess with this 26. I won't be downloading it ever' One user took to X to ask if he was the 'only one who didn't like the clock icon' However, the change to the Clock app is what seems to have really angered users. On X, @applesclubs posted a comparison of the old and new Clock app icons and asked which one people preferred. In response, one user said: 'iOS 18 can't read iOS 26,' while another wrote: 'Definitely the one in iOS 18.' One user replied: 'This is the only part of ios26 design I don't like.' And another added: '18. 26 is just plain ugly. but to each their own.' iOS 26 is currently only available as a developer beta - an unfinished version of the software not for public release - with the full version expected around September later this year.

JC Flowers-backed Jefferson Capital targets $1.1 billion valuation in US IPO
JC Flowers-backed Jefferson Capital targets $1.1 billion valuation in US IPO

Reuters

time42 minutes ago

  • Reuters

JC Flowers-backed Jefferson Capital targets $1.1 billion valuation in US IPO

June 13 (Reuters) - Private equity-backed Jefferson Capital is targeting a valuation of up to $1.1 billion in its initial public offering in the United States, the debt buyer said on Friday. The Minneapolis, Minnesota-based company and some of its existing shareholders are seeking to raise up to $170 million by offering 10 million shares priced between $15 and $17 each. Investment firm J.C. Flowers had acquired Jefferson Capital from buyout firm Flexpoint Ford in 2018. Jefferson Capital will list on the Nasdaq under the symbol "JCAP". Jefferies and Keefe, Bruyette & Woods are the lead underwriters for the offering.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store