
AI chatbots need more books - these libraries are opening their stacks
Everything ever said on the Internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century - and in 254 languages - are part of a Harvard University collection being released to AI researchers on Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artists and others whose creative works have been scooped up without their consent to train AI chatbots.
"It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold "significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by "unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
"We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab.
"Librarians have always been the stewards of data and the stewards of information.'
Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper.
One of the earlier works is from the 1400s - a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
Harvard's collection was already digitised starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books. Photo: AP
"A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes "all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books.
They just needed lots of what computer scientists call tokens - units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems. Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from "shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated US$50mil (RM211mil) this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitising rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the US, the library made clear that any information it digitised would be for everyone, said Jessica Chapel, its chief of digital and online services.
"OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitisation is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'A lot of the data that's been used in AI training has not come from original sources,' says Leppert, executive director at the Institutional Data Initiative. Photo: AP
"We've been very clear that, 'Hey, we're a public library,'" Chapel said. "Our collections are held for public use, and anything we digitised as part of this project will be made public.'
Harvard's collection was already digitised starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works. It was finally settled in 2016 when the US Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the US typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared on Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be "immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
"At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. "You have a lot of scientific information about how to run processes and how to run analyses.'
At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
"When you're dealing with such a large data set, there are some tricky issues around harmful content and language," said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to "help them make their own informed decisions and use AI responsibly.' - AP
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


The Star
an hour ago
- The Star
Football club investor Eagle files for US IPO, Bloomberg News reports
FILE PHOTO: The logo for the New York Stock Exchange (NYSE) is displayed at the NYSE in New York City, U.S., July 6, 2023. REUTERS/Brendan McDermid/File Photo (Reuters) -Eagle Football Holdings, one of the most active investors in global football clubs, has confidentially filed for a U.S. initial public offering, Bloomberg News reported on Friday. Reuters could not immediately confirm the report. The France-based company submitted a draft registration statement to the U.S. Securities and Exchange Commission, the report added, citing a company statement. The firm has been working with UBS Group AG on the potential IPO, the report said. Eagle Football, run by U.S. businessman John Textor, holds stakes in multiple football clubs including Olympique Lyonnais, Crystal Palace and Brazil's Botafogo. The SEC and Eagle Football did not immediately respond to request for comment outside regular business hours. (Reporting by Ananya Palyekar in Bengaluru; Editing by Leslie Adler)


The Star
2 hours ago
- The Star
Analysis-Meta's $14.8 billion Scale AI deal latest test of AI partnerships
(Reuters) -Facebook owner Meta's $14.8 billion investment in Scale AI and hiring of the data-labeling startup's CEO will test how the Trump administration views so-called acquihire deals, which some have criticizedas an attempt to evade regulatory scrutiny. The deal, announced on Thursday, was Meta's second-largest investment to date. It gives the owner of Facebook a 49% nonvoting stake in Scale AI, which uses gig workers to manually label data and includes among its customers Meta competitors Microsoft and ChatGPT creator OpenAI. Unlike an acquisition or a transaction that would give Meta a controlling stake, the deal does not require a review by U.S. antitrust regulators. However, they could probe the deal if they believe it was structured to avoid those requirements or harm competition. The deal appeared to be structured to avoid potential pitfalls, such as cutting off competitors' access to Scale's services or giving Meta an inside view into rivals' operations - though Reuters exclusively reported on Friday that Alphabet's Google has decided to sever ties with Scale in light of Meta's stake, and other customers are looking at taking a step back. In a statement, a Scale AI spokesperson said its business, which spans work with major companies and governments, remains strong, as it is committed to protecting customer data. The company declined to comment on specifics with Google. Alexandr Wang, Scale's 28-year-old CEO who is coming to Meta as part of the deal, will remain on Scale's board but will have appropriate restrictions placed around his access to information, two sources familiar with the move confirmed. Large tech companies likely perceive the regulatory environment for AI partnerships as easier to navigate under President Donald Trump than under former President Joe Biden, said William Kovacic, director of the competition law center at George Washington University. Trump's antitrust enforcers have said they do not want to regulate how AI develops, but have also displayed a suspicion of large tech platforms, he added. "That would lead me to think they will keep looking carefully at what the firms do. It does not necessarily dictate that they will intervene in a way that would discourage the relationships," Kovacic said. Federal Trade Commission probes into past "aquihire" deals appear to be at a standstill. Under the Biden administration, the FTC opened inquiries into Amazon's deal to hire top executives and researchers from AI startup Adept, and Microsoft's $650 million deal with Inflection AI. The latter allowed Microsoft to use Inflection's models and hire most of the startup's staff, including its co-founders. Amazon's deal closed without further action from the regulator, a source familiar with the matter confirmed. And, more than a year after its initial inquiry, the FTC has so far taken no enforcement action against Microsoft over Inflection, though a larger probe over practices at the software giant is ongoing. A spokesperson for the FTC declined to comment on Friday. David Olson, a professor who teaches antitrust law at Boston College Law School, said it was smart of Meta to take a minority nonvoting stake. "I think that does give them a lot of protection if someone comes after them," he said, adding that it was still possible that the FTC would want to review the agreement. The Meta deal has its skeptics. U.S. Senator Elizabeth Warren, a Democrat from Massachusetts who is probing AI partnerships involving Microsoft and Google, said Meta's investment should be scrutinized. 'Meta can call this deal whatever it wants - but if it violates federal law because it unlawfully squashes competition or makes it easier for Meta to illegally dominate, antitrust enforcers should investigate and block it," she said in a statement on Friday. While Meta faces its own monopoly lawsuit by the FTC, it remains to be seen whether the agency will have any questions about its Scale investment. The U.S. Department of Justice's antitrust division, led by former JD Vance adviser Gail Slater, recently started looking into whether Google's partnership with chatbot creator was designed to evade antitrust review, Bloomberg News reported. The DOJ is separately seeking to make Google give it advance notice of new AI investments as part of a proposal to curb the company's dominance in online search. (Reporting by Jody Godoy and Milana Vinn in New York; Editing by Chris Sanders and Matthew Lewis)


The Star
2 hours ago
- The Star
Exclusive-Google, Scale AI's largest customer, plans split after Meta deal, sources say
SAN FRANCISCO (Reuters) -Alphabet's Google, the largest customer of Scale AI, plans to cut ties with Scale after news broke that rival Meta is taking a 49% stake in the AI data-labeling startup, five sources familiar with the matter told Reuters. Google had planned to pay Scale AI about $200 million this year for the human-labeled training data that is crucial for developing technology, including the sophisticated AI models that power Gemini, its ChatGPT competitor, one of the sources said. The search giant already held conversations with several of Scale AI's rivals this week as it seeks to shift away much of that workload, sources added. Scale's loss of significant business comes as Meta takes a big stake in the company, valuing it at $29 billion. Scale was worth $14 billion before the deal. Scale AI intends to keep its business running while its CEO, Alexandr Wang, along with a few employees, move over to Meta. Since its core business is concentrated around a few customers, it couldsuffer greatly if it loses key customers like Google. In a statement, a Scale AI spokesperson said its business, which spans work with major companies and governments, remains strong, as it is committed to protecting customer data. The company declined to comment on specifics with Google. Scale AI raked in $870 million in revenue in 2024, andGoogle spent some $150 million on Scale AI's services last year, sources said. Other major tech companies that are customers of Scale's, including Microsoft, are also backing away. Elon Musk's xAI is also looking to exit, one of the sources decided to pull back from Scale several months ago, according to sources familiar with the matter, though it spends far less money than Google. OpenAI's CFO said on Friday that the company will continue to work with Scale AI, as one of its many data vendors. Companies that compete with Meta in developing cutting-edge AI models are concerned that doing business with Scale could expose their research priorities and road map to a rival, five sources said. By contracting with Scale AI, customers often share proprietary data as well as prototype products for which Scale's workers are providing data-labeling services. With Meta now taking a 49% stake, AI companies are concerned that one of their chief rivals could gain knowledge about their business strategy and technical blueprints. Google, Microsoft and OpenAI declined to comment. xAI did not respond to a request for comment. RIVALS SEE OPENINGS The bulk of Scale AI's revenue comes from charging generative AI model makers for providing access to a network of human trainers with specialized knowledge - from historians to scientists, some with doctorate degrees. The humans annotate complex datasets that are used to "post-train" AI models, and as AI models have become smarter, the demand for the sophisticated human-provided examples has surged, and one annotation could cost as much as $100. Scale also does data-labeling for enterprises like self-driving car companies and the U.S. government, which are likely to stay, according to the sources. But its biggest money-maker is in partnering with generative AI model makers, the sources said. Google had already sought to diversify its data service providers for more than a year, three of the sources said. But Meta's moves this week have led Google to seek to move off Scale AI on all its key contracts, the sources added. Because of the way data-labeling contracts are structured, that process could happen quickly, two sources said. This will provide an opening for Scale AI's rivals to jump in. "The Meta-Scale deal marks a turning point," said Jonathan Siddharth, CEO of Turing, a Scale AI competitor. "Leading AI labs are realizing neutrality is no longer optional, it's essential." Labelbox, another competitor, will "probably generate hundreds of millions of new revenue" by the end of the year from customers fleeing Scale, its CEO, Manu Sharma, told Reuters. Handshake, a competitor focusing on building a network of PhDs and experts, saw a surge of workload from top AI labs that compete with Meta. "Our demand has tripled overnight after the news," said Garrett Lord, CEO at Handshake. Many AI labs now want to hire in-house data-labelers, which allows their data to remain secure, said Brendan Foody, CEO of Mercor, a startup that in addition to competing directly with Scale AI also builds technology around being able to recruit and vet candidates in an automated way, enabling AI labs to scale up their data labeling operations quickly. Founded in 2016, Scale AI provides vast amounts of labeled data or curated training data, which is crucial for developing sophisticated tools such as OpenAI's ChatGPT. The Meta deal will be a boon for Scale AI's investors including Accel and Index Ventures, as well as its current and former employees. As part of the deal, Scale AI's CEO, Wang, will take a top position leading Meta's AI efforts. Meta is fighting the perception that it may have fallen behind in the AI race after its initial set of Llama 4 large language models released in April fell short of performance expectations. (Reporting by Anna Tong and Kenrick Cai in San Francisco and Krystal Hu in New York; editing by Kenneth Li and Matthew Lewis)