
AI chatbots need more books to learn from. These libraries are opening their stacks
CAMBRIDGE, Mass. (AP) — Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists
and others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes 'all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'' Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
Wednesdays
Columnist Jen Zoratti looks at what's next in arts, life and pop culture.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language,' said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Winnipeg Free Press
3 hours ago
- Winnipeg Free Press
Trump clears path for Nippon Steel investment in US Steel, so long as it fits the government's terms
WASHINGTON (AP) — President Donald Trump on Friday signed an executive order paving the way for a Nippon Steel investment in U.S. Steel, so long as the Japanese company complies with a 'national security agreement' submitted by the federal government. Trump's order didn't detail the terms of the national security agreement. But U.S. Steel and Nippon Steel said in a joint statement that the agreement stipulates that approximately $11 billion in new investments will be made by 2028 and includes giving the U.S. government a 'golden share' — essentially veto power to ensure the country's national security interests are protected. 'We thank President Trump and his Administration for their bold leadership and strong support for our historic partnership,' the two companies said. 'This partnership will bring a massive investment that will support our communities and families for generations to come. We look forward to putting our commitments into action to make American steelmaking and manufacturing great again.' The companies have completed a U.S. Department of Justice review and received all necessary regulatory approvals, the statement said. 'The partnership is expected to be finalized promptly,' the statement said. The companies offered few details on how the golden share would work and what investments would be made. Trump said Thursday that he would as president have 'total control' of what U.S. Steel did as part of the investment. Trump said then that the deal would preserve '51% ownership by Americans.' The Japan-based steelmaker had been offering nearly $15 billion to purchase the Pittsburgh-based U.S. Steel in a merger that had been delayed on national security concerns starting during Joe Biden's presidency. Trump opposed the purchase while campaigning for the White House, yet he expressed optimism in working out an arrangement once in office. 'We have a golden share, which I control,' said Trump, although it was unclear what he meant by suggesting that the federal government would determine what U.S. Steel does as a company. Trump added that he was 'a little concerned' about what presidents other than him would do with their golden share, 'but that gives you total control.' Still, Nippon Steel has never said it was backing off its bid to buy and control U.S. Steel as a wholly owned subsidiary. The proposed merger had been under review by the Committee on Foreign Investment in the United States, or CFIUS, during the Trump and Biden administrations. The order signed Friday by Trump said the CFIUS review provided 'credible evidence' that Nippon Steel 'might take action that threatens to impair the national security of the United States,' but such risks might be 'adequately mitigated' by approving the proposed national security agreement. Monday Mornings The latest local business news and a lookahead to the coming week. The order doesn't detail the perceived national security risk and only provides a timeline for the national security agreement. The White House declined to provide details on the terms of the agreement. The order said the draft agreement was submitted to U.S. Steel and Nippon Steel on Friday. The two companies must successfully execute the agreement as decided by the Treasury Department and other federal agencies that are part CFIUS by the closing date of the transaction. Trump reserves the authority to issue further actions regarding the investment as part of the order he signed on Friday. ___ Associated Press writer Marc Levy in Harrisburg, Pa., contributed to this report.


Toronto Star
4 hours ago
- Toronto Star
Canadian and U.S. stocks down after Israeli attacks on Iran, price of oil jumps
TORONTO - Canada's main stock index closed down along with U.S. markets Friday as investors turned cautious following Israeli attacks on Iranian nuclear and military targets. The attacks, which prompted Iran to fire missiles at Israel in retaliation, raised fears the conflict could escalate further and led to a spike in the price of oil. 'It's clearly a risk-off situation, and a spot where people that maybe want to take a little bit of risk off the table have the opportunity to do so,' said Dustin Reid, chief fixed income strategist at Mackenzie Investments. ARTICLE CONTINUES BELOW Oil prices leapt, and stocks fell on worries that escalating violence following Israel's attack on Iranian nuclear and military targets could damage the flow of crude around the world, along with the global economy. (AP Video / June 13, 2025) The price of oil, already rising this week, spiked over fears of supply and trade disruptions, with the August crude oil contract up US$4.65 at US$71.29 per barrel. Higher oil prices helped soften the effects of the pullback on the S&P/TSX composite index, which closed down 111.40 points at 26,504.35 but was less affected than U.S. markets, noted Reid. 'You see materials and energy, subcomponents here within the TSX doing a little bit better, and keeping the index probably, you know, outperforming versus others.' The TSX energy index was up 2.8 per cent and gold stocks moved higher as the metal also rose, helping offset losses in most other sectors including financials, telecoms and technology. In New York, the Dow Jones industrial average was down 769.83 points, or 1.8 per cent, at 42,197.79. The S&P 500 index was down 68.29 points at 5,976.97, while the Nasdaq composite was down 255.66 points at 19,406.83. A big concern for markets is that higher oil prices will put pressure on inflation, and in turn affect interest rate decisions, said Reid. ARTICLE CONTINUES BELOW ARTICLE CONTINUES BELOW 'It's not particularly constructive for the idea that central banks can cut rates any time soon.' The higher prices could also dampen consumer spending, while the wider situation also creates higher degrees of uncertainty, he said. 'It's probably not great for global sentiment, consumer sentiment,' said Reid. 'So I am a little bit concerned here that the gains that have been had over the last handful of weeks, could be somewhat at risk.' The Canadian dollar rose, trading for 73.54 cents US compared with 73.46 cents US on Thursday, thanks in part to higher oil prices, but it didn't move as much as it might have because investors fled to the U.S. dollar for safety, said Reid. 'The Canadian dollar is surprisingly flat, kind of net net today, against the U.S. dollar anyway,' he said. 'We are seeing a decent bid for the U.S. dollar on safe haven, which has not been the case particularly since early April.' ARTICLE CONTINUES BELOW ARTICLE CONTINUES BELOW The Canadian dollar wasn't helped by manufacturing sales data out Friday that showed a fall of 2.8 per cent in April, the largest monthly drop since October 2023, as the tariff dispute with the United States hit the industry. 'The organic Canadian economy is slowing, and will continue to slow, and you can see it across different spots of the economy, manufacturing clearly being one,' said Reid. The July natural gas contract was up nine cents US at US$3.58 per mmBTU. The August gold contract was up US$50.40 at US$3,452.80 an ounce and the July copper contract was down three cents US at US$4.81 a pound. This report by The Canadian Press was first published June 13, 2025. Companies in this story: (TSX:GSPTSE, TSX:CADUSD)


Vancouver Sun
5 hours ago
- Vancouver Sun
Delivery services under legal scrutiny for alleged 'drip pricing'
The practice known as 'drip pricing' is front and centre again in an action by the federal Competition Bureau against DoorDash and in a proposed class-action lawsuit brought by a Toronto law firm against Uber Eats. Drip pricing generally involves enticing customers by advertising low prices, but charging extra mandatory fees, usually when they are checking out. It continues to come under fire because 'disclosure around pricing and fees in various consumer transactions is, at times, less than thorough and transparent,' says Mike Robb, partner with London, Ontario-based law firm, Siskinds. Start your day with a roundup of B.C.-focused news and opinion. By signing up you consent to receive the above newsletter from Postmedia Network Inc. A welcome email is on its way. If you don't see it, please check your junk folder. The next issue of Sunrise will soon be in your inbox. Please try again Interested in more newsletters? Browse here. The Competition Bureau says w hen 'the represented price is inaccurate, it makes it more difficult for consumers to comparison shop and result(s) in unfair outcomes for honest competitors.' Canada's competition watchdog is hauling DoorDash Inc. and its Canadian subsidiary before the Competition Tribunal, accusing them of portraying the online cost of delivery as lower than the price consumers ultimately pay. The Competition Bureau says it investigated and is alleging DoorDash customers paid more, due to mandatory fees, added during checkout. The extra fees, the bureau says, include charges such as extra amounts for delivering items a further distance and for placing smaller orders. The bureau alleges the discretionary charges were sometimes framed as taxes. The bureau alleges DoorDash may have used drip pricing for close to a decade to make nearly $1 billion from mandatory fees, according to the Canadian Press . The bureau is asking the Competition Tribunal to order the company to stop the practice, cease portraying fees as taxes, pay a penalty and issue restitution to affected consumers. However, DoorDash is pushing back. 'This application is a misguided and excessive attempt to target one of Canada's leading local commerce platforms,' DoorDash spokesperson Trent Hodson told CP . 'It unfairly singles out DoorDash, and we intend to vigorously defend ourselves against these claims.' Still, the bureau is standing its ground. 'Our litigation against DoorDash is another example of our efforts to ensure consumers are not misled and can trust the prices they see online. We urge all businesses to review their pricing practices and make sure they comply with the law,' said Matthew Boswell, commissioner of competition in a press release . The Competition Bureau has been more aggressive of late in battling drip pricing. Last fall, the bureau won a deceptive marketing case against Cineplex Inc. , noted Robb. It had been adding a mandatory $1.50 online booking fee. The company was ordered to pay a financial penalty of almost $39 million. Last summer, says Robb, the bureau reached an agreement with SiriusXM Canada . In that case, the company was ordered to pay a $3.3 million penalty over adding a fee on subscription plans that increased the monthly cost. Meanwhile, legal action against drip pricing is not exclusive to public regulators. Law firms that navigate class actions are getting in on the act too. Toronto firm, Koskie Minsky filed a statement of claim against Uber Eats with the Ontario Superior Court Justice last month. It alleges Uber Eats has been hiding an additional service fee within its overall delivery costs. The proposed class action alleges that Uber misrepresented the true cost of delivery by not disclosing the service fee until the final stage of the transaction, 'often obscured under a 'Taxes & Other Fees' line item, a practice known as drip pricing,' says the law firm on its website. The action has been brought on behalf of Canadian residents who on or after May 16, 2023, placed a delivery order using Uber Eats and paid a service fee. Further, the lawsuit alleges Uber One members, who are supposed to enjoy benefits such as no delivery fees on eligible orders, have been paying the service fee. It's 'really a delivery fee as it only applies to delivery orders' and it 'constitutes a breach of contract and negates the advertised benefit of the subscription.' Robb says 'the existence of parallel proceedings in these cases is not necessarily surprising or unusual.' He explains that the Competition Bureau has a statutory mandate to protect Canadian consumers and businesses from allegedly unfair business practices. In its case against DoorDash , it is asking the Competition Tribunal to provide restitution to consumers, though that's somewhat unusual, he says. 'It may or may not be equipped to negotiate and deliver remedies to consumers.' However, he points out that class actions always focus on recovery for consumers, 'even when the amounts are individually minimal. It is common in our cases that when they resolve, an administration mechanism is established to facilitate an accessible distribution of modest amount to individual consumers.' A recent example would be a payout website established for the bread-fixing class-action settlement. Our website is the place for the latest breaking news, exclusive scoops, longreads and provocative commentary. Please bookmark and sign up for our daily newsletter, Posted, here .