logo
AI chatbots need more books to learn from

AI chatbots need more books to learn from

CTV Newsa day ago

CAMBRIDGE, Mass — Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists and others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes 'all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated US$50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'' Chapel said. 'Our collections are held for public use, and anything we digitized as part of this project will be made public.'
Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language,' said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
————
The Associated Press and OpenAI have a licensing and technology agreement that allows OpenAI access to part of AP's text archives.
Matt O'brien, The Associated Press

Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Stock Market News for Jun 13, 2025
Stock Market News for Jun 13, 2025

Globe and Mail

time30 minutes ago

  • Globe and Mail

Stock Market News for Jun 13, 2025

U.S. stock markets closed higher on Thursday as market participants weighed the outcome of the U.S.-China trade talks. A softer-then-expected key inflation data and a weak labor market data bolstered investors' sentiment. All three major stock indexes ended in positive territory. How Did The Benchmarks Perform? The Dow Jones Industrial Average (DJI) rose 0.2% or 101.85 points to close at 42,967.62. Notably, 19 components of the 30-stock index ended in positive territory and 11 finished in negative zone. The tech-heavy Nasdaq Composite finished at 19,662.48, advancing 0.2% due to strong performance of technology bigwigs. AI-based semiconductor giants like NVIDIA Corp. NVDA and Broadcom Inc. AVGO rose 1.5% and 1.3%, respectively. Both stocks currently carry a Zacks Rank #3 (Hold). You can see the complete list of today's Zacks #1 Rank (Strong Buy) stocks here. The S&P 500 gained 0.3% to finish at 6,045.26. Wall Street's most observed benchmark is currently less than 2% away from its all-time high. Seven out of 11 broad sectors of the broad-market index ended in positive territory while three in negative zone and one remained unchanged. The Utilities Select Sector SPDR (XLU) and the Technology Select Sector SPDR (XLK) rose 1.2% and 0.9%, respectively. On the other hand, the Communication Services Select Sector SPDR (XLE) fell 0.8%. The fear-gauge CBOE Volatility Index (VIX) was up 4.4% to 18.02. A total of 23.5 billion shares were traded on Thursday, higher than the last 20-session average of 18 billion. The S&P 500 recorded 12 new 52-week highs and 3 new 52-week lows. The Nasdaq registered 54 new 52-week highs and 63 new 52-week lows. Investors Weigh U.S.-China Trade Talk The United States and China reached an agreement for trade and tariffs in London. U.S. Commerce Secretary Howard Lutnick said 'We have reached a framework to implement the Geneva consensus and the call between the two presidents.' This was echoed by Li Chenggang, China's international trade representative and a vice minister at China's Commerce Ministry. President Donald Trump said that the deal with China is 'done, subject to final approval with President Xi and me.' U.S. was seeking confirmation that China would restore critical mineral (rare earth) exports. Beijing protested against the U.S. Commerce Department's warnings to U.S. chipset manufacturers against using Chinese semiconductors. On May 12, the United States and China have decided to for a 90-day pause of tariff implementations. Economic Data The Department of Labor reported that initial claims remained flat at 248,000 for the week ended Jun 7, higher-than the consensus estimate of 246,000. Previous week's data was revised marginally upward by 1,000 from 247,000 reported earlier. Continuing claims (those who have already received government aids and reported a week behind) increased 54,000 to 1.956 million. This is the highest level for insured unemployment since Nov 13, 2021. Previous week's data was revised downward by 2,000 to 1.904 million. The Department of Labor reported that the producer price index (PPI) increased 0.1% in May, less-than-the consensus estimate of 0.2%. The metric for April was revised upward to a decline of 0.2% from a drop of 0.5% reported earlier. Year over year, PPI increased 2.6% in May. Core PPI (excluding volatile food and energy items) increased 0.2% in May, less-than-the consensus estimate of 0.3%. The metric for April was revised downward to 0.3% from 0.4% reported earlier. Year over year, core PPI increased 2.7% in May. 7 Best Stocks for the Next 30 Days Just released: Experts distill 7 elite stocks from the current list of 220 Zacks Rank #1 Strong Buys. They deem these tickers "Most Likely for Early Price Pops." Since 1988, the full list has beaten the market more than 2X over with an average gain of +23.5% per year. So be sure to give these hand picked 7 your immediate attention. See them now >> Want the latest recommendations from Zacks Investment Research? Today, you can download 7 Best Stocks for the Next 30 Days. NVIDIA Corporation (NVDA): Free Stock Analysis Report Broadcom Inc. (AVGO): Free Stock Analysis Report

U.S. equity fund outflows ease on cooling inflation pressure, trade deal optimism
U.S. equity fund outflows ease on cooling inflation pressure, trade deal optimism

CTV News

time41 minutes ago

  • CTV News

U.S. equity fund outflows ease on cooling inflation pressure, trade deal optimism

Trader Ryan Falvey works on the floor of the New York Stock Exchange, Monday, June 9, 2025. (AP Photo/Richard Drew) U.S. equity funds witnessed the smallest weekly net disposal in four weeks in the week through June 11 as a smaller than expected rise in consumer prices in May, and a U.S. trade deal with China, eased investor worries. According to LSEG Lipper data, investors liquidated just $212 million worth of U.S. equity funds during the week, the smallest weekly net outflow since approximately $13.65 billion worth of net purchases a month ago. U.S. sectoral funds, however, still witnessed net inflows worth a sharp $1.53 billion, the biggest amount for a week in four. Communication services, financial and industrial sectors with $529 million, $399 million and $388 million in net inflows, lead the gains. The equity large-cap, mid-cap and small-cap fund segments, meanwhile, faced a net $2.65 billion, $1.35 billion and $100 million worth of sales. Investors added money into U.S. bond funds for an eight consecutive week, with their $4.08 billion worth of weekly net purchase. They racked up U.S. short-to-intermediate investment-grade funds, short-to-intermediate government & treasury funds, and municipal debt funds worth a notable $2.37 billion, $1.02 billion and $523 million, respectively. At the same time, money market funds had a net $15.18 billion worth of weekly outflow, partly reversing a significant $66.24 billion weekly inflow, gained in the previous week.

Fed expected to keep interest rates steady as tariff risks outweigh inflation data
Fed expected to keep interest rates steady as tariff risks outweigh inflation data

Globe and Mail

time41 minutes ago

  • Globe and Mail

Fed expected to keep interest rates steady as tariff risks outweigh inflation data

The Federal Reserve is widely expected to hold interest rates steady next week, with investors focused on new central bank projections that will show how much weight policy makers are putting on recent soft data and how much risk they attach to unresolved trade and budget issues. The release of a series of inflation readings has eased concern that the tariffs imposed by U.S. President Donald Trump would translate quickly into higher prices, while the latest monthly employment report showed slowing job growth - a combination that, all things equal, would put the Fed closer to resuming its rate cuts. Trump has demanded the U.S. central bank lower its benchmark overnight interest rate immediately by a full percentage point, a dramatic step that would amount to an all-in bet by the Fed that inflation will fall to its 2-per-cent target and stay there regardless of what the administration does and even with dramatically looser financial conditions. Fed meeting in focus as investors seek rate-path hints Yet the President's push to rewrite the rules of global trade remains a work in progress. Since the Fed's last policy meeting in May, the administration delayed until next month a threatened round of global tariffs that central bank officials worry could lead to both higher inflation and slower growth if implemented; trade tensions between the U.S. and China have eased but not been resolved; and the terms of a massive budget and tax bill under consideration in Congress are far from settled. When Fed officials issued their last set of quarterly projections in March, anticipating two quarter-percentage-point rate cuts this year, Fed Chair Jerome Powell noted the role that inertia can play in moments when the outlook is so unclear that 'you just say 'maybe I'll stay where I am,'' a sentiment that may last as long as the tariff debate remains unresolved. 'Recent Fed commentary has reinforced a wait-and-see approach, with officials signaling little urgency to adjust policy amid increased uncertainty around the economic outlook,' Gregory Daco, chief economist at EY-Parthenon, wrote in the run-up to the Fed's June 17-18 meeting. Daco said he anticipates the median rate projection among the Fed's 19 policy makers to still show two rate cuts in 2025, with an overall tone of 'cautious patience' and 'little in the way of forward guidance' given the uncertainty weighing on households and businesses. Middle East takes centre stage: World market themes for the week ahead That view aligns roughly with what investors in contracts tied to the Fed's policy rate currently expect, though pricing shifted towards a possible third rate cut this week after data showed consumer and producer prices both increased less than expected in May. While year-over-year inflation measured by the Fed's preferred Personal Consumption Expenditures Price Index is around half a percentage point above the central bank's target, recent data show it running close to 2 per cent for the past three months once the more volatile food and energy components are excluded. The unemployment rate, meanwhile, has remained at 4.2 per cent for the past three months. The Fed's policy rate was set in the current 4.25 per cent to 4.5 per cent range in December when the U.S. central bank cut it by a quarter of a percentage point in what officials at the time expected would be a steady series of reductions in borrowing costs spurred by slowing inflation. The trade policy Trump pursued after he returned to office on Jan. 20, however, raised the risk of higher inflation and slower growth, an outcome that would put the Fed in the uncomfortable position of having to choose whether to focus on keeping inflation at its 2-per-cent target or supporting the economy and sustaining low unemployment. The risk of that worst-of-both-worlds outcome has eased since the early spring, when Trump's 'Liberation Day' slate of global tariffs caused a market backlash and led to widespread forecasts of a U.S. recession before the president backed down. In its most recent analysis, Goldman Sachs analysts lowered the odds of a recession to around 30% and said they now see a bit less inflation and slightly higher growth this year. Yet that analysis did not prompt a shift in the investment bank's Fed rate outlook, which currently expects higher inflation numbers over the summer to sideline the central bank until December. The Fed itself may see its median rate projection fall to a single quarter-percentage-point cut this year if only due to the passage of time, noted Tim Duy, chief U.S. economist at SGH Macro Advisors. With three fewer months in the year to make changes in policy and so many major issues outstanding, 'if the Fed retained two cuts ... it would have more confidence in those two cuts than in March,' Duy wrote. 'But ... participants have less confidence in rate cuts since 'Liberation Day,' and that should be reflected' in the new projections. It would only take two officials to change their outlooks for the Fed's projected rate reductions to shift more toward next year. There's another scenario, one in which the weak pass-through from tariffs to inflation is due to weakening demand as consumers pay more for imported goods by cutting back on services, a dynamic that may already be developing. The retail sales report for May, which is due to be released next week ahead of the Fed meeting, may provide insight into that issue. But Citi economists say they think weakening demand will keep inflation down, lead to rising unemployment, and prompt the central bank to cut rates faster than expected, beginning in September and continuing at each meeting from there into 2026. 'Tariffs may eventually boost some goods prices, but the broad-based slowing in core services inflation will make this a one-time price increase,' the Citi analysts wrote. 'Markets have yet to internalize that softer demand will lead to cooler inflation but also to rising unemployment ... The path to Fed rate cuts is becoming increasingly clear.'

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into the world of global news and events? Download our app today from your preferred app store and start exploring.
app-storeplay-store