
Beyond the internet: AI learning from 15th-century texts
Everything ever said on the internet was just the start of teaching artificial intelligence about humanity. Tech companies are now tapping into an older repository of knowledge: the library stacks.
Nearly one million books published as early as the 15th century — and in 254 languages — are part of a Harvard University collection being released to AI researchers Thursday. Also coming soon are troves of old newspapers and government documents held by Boston's public library.
Cracking open the vaults to centuries-old tomes could be a data bonanza for tech companies battling lawsuits from living novelists, visual artistsand others whose creative works have been scooped up without their consent to train AI chatbots.
'It is a prudent decision to start with public domain data because that's less controversial right now than content that's still under copyright,' said Burton Davis, a deputy general counsel at Microsoft.
Davis said libraries also hold 'significant amounts of interesting cultural, historical and language data' that's missing from the past few decades of online commentary that AI chatbots have mostly learned from.
Supported by 'unrestricted gifts' from Microsoft and ChatGPT maker OpenAI, the Harvard-based Institutional Data Initiative is working with libraries around the world on how to make their historic collections AI-ready in a way that also benefits libraries and the communities they serve.
'We're trying to move some of the power from this current AI moment back to these institutions,' said Aristana Scourtas, who manages research at Harvard Law School's Library Innovation Lab. 'Librarians have always been the stewards of data and the stewards of information.' Harvard's newly released dataset, Institutional Books 1.0, contains more than 394 million scanned pages of paper. One of the earlier works is from the 1400s — a Korean painter's handwritten thoughts about cultivating flowers and trees. The largest concentration of works is from the 19th century, on subjects such as literature, philosophy, law and agriculture, all of it meticulously preserved and organized by generations of librarians.
It promises to be a boon for AI developers trying to improve the accuracy and reliability of their systems.
'A lot of the data that's been used in AI training has not come from original sources,' said the data initiative's executive director, Greg Leppert, who is also chief technologist at Harvard's Berkman Klein Center for Internet & Society. This book collection goes 'all the way back to the physical copy that was scanned by the institutions that actually collected those items,' he said.
Before ChatGPT sparked a commercial AI frenzy, most AI researchers didn't think much about the provenance of the passages of text they pulled from Wikipedia, from social media forums like Reddit and sometimes from deep repositories of pirated books. They just needed lots of what computer scientists call tokens — units of data, each of which can represent a piece of a word.
Harvard's new AI training collection has an estimated 242 billion tokens, an amount that's hard for humans to fathom but it's still just a drop of what's being fed into the most advanced AI systems.
Facebook parent company Meta, for instance, has said the latest version of its AI large language model was trained on more than 30 trillion tokens pulled from text, images and videos.
Meta is also battling a lawsuit from comedian Sarah Silverman and other published authors who accuse the company of stealing their books from 'shadow libraries' of pirated works.
Now, with some reservations, the real libraries are standing up.
OpenAI, which is also fighting a string of copyright lawsuits, donated $50 million this year to a group of research institutions including Oxford University's 400-year-old Bodleian Library, which is digitizing rare texts and using AI to help transcribe them.
When the company first reached out to the Boston Public Library, one of the biggest in the U.S., the library made clear that any information it digitized would be for everyone, said Jessica Chapel, its chief of digital and online services.
'OpenAI had this interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So this is kind of just a case that things are aligning,' Chapel said.
Digitization is expensive. It's been painstaking work, for instance, for Boston's library to scan and curate dozens of New England's French-language newspapers that were widely read in the late 19th and early 20th century by Canadian immigrant communities from Quebec. Now that such text is of use as training data, it helps bankroll projects that librarians want to do anyway.
'We've been very clear that, 'Hey, we're a public library,'' Chapel said.
'Our collections are held for public use, and anything we digitized as part of this project will be made public.' Harvard's collection was already digitized starting in 2006 for another tech giant, Google, in its controversial project to create a searchable online library of more than 20 million books.
Google spent years beating back legal challenges from authors to its online book library, which included many newer and copyrighted works.
It was finally settled in 2016 when the U.S. Supreme Court let stand lower court rulings that rejected copyright infringement claims.
Now, for the first time, Google has worked with Harvard to retrieve public domain volumes from Google Books and clear the way for their release to AI developers. Copyright protections in the U.S. typically last for 95 years, and longer for sound recordings.
How useful all of this will be for the next generation of AI tools remains to be seen as the data gets shared Thursday on the Hugging Face platform, which hosts datasets and open-source AI models that anyone can download.
The book collection is more linguistically diverse than typical AI data sources. Fewer than half the volumes are in English, though European languages still dominate, particularly German, French, Italian, Spanish and Latin.
A book collection steeped in 19th century thought could also be 'immensely critical' for the tech industry's efforts to build AI agents that can plan and reason as well as humans, Leppert said.
'At a university, you have a lot of pedagogy around what it means to reason,' Leppert said. 'You have a lot of scientific information about how to run processes and how to run analyses.' At the same time, there's also plenty of outdated data, from debunked scientific and medical theories to racist narratives.
'When you're dealing with such a large data set, there are some tricky issues around harmful content and language,' said Kristi Mukk, a coordinator at Harvard's Library Innovation Lab who said the initiative is trying to provide guidance about mitigating the risks of using the data, to 'help them make their own informed decisions and use AI responsibly.'
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Al Jazeera
20 hours ago
- Al Jazeera
China unveils newest AI technology at World Robot Conference
China unveils newest AI technology at World Robot Conference NewsFeed More than 200 companies showcase their latest innovations at the World Robot Conference in Beijing, China. Al Jazeera's Katrina Yu comes face-to-face with the latest in robot technology. Video Duration 00 minutes 39 seconds 00:39 Video Duration 00 minutes 57 seconds 00:57 Video Duration 01 minutes 11 seconds 01:11 Video Duration 02 minutes 06 seconds 02:06 Video Duration 01 minutes 06 seconds 01:06 Video Duration 01 minutes 04 seconds 01:04 Video Duration 03 minutes 00 seconds 03:00


Qatar Tribune
2 days ago
- Qatar Tribune
Trump's curbs on China's shipbuilding edge spur S Korean investments in Asean
Agencies As the US seeks to curb China's shipbuilding dominance, South Korea is looking to capitalise by expanding its overseas footprint through shipyard investments in the Philippines and Vietnam. Analysts said South Korean shipbuilders may view the move as a way to tackle constraints that have diminished their competitiveness against China, such as limited capacity, labour shortages and tensions with domestic trade unions. Shifting operations to Southeast Asia, where labour costs are lower, could strengthen their edge, said Du Yu, General Manager of Drewry's China office. 'But it takes time to improve workers' technical skills,' she cautioned, noting that the strategy could be a viable solution for Korean shipbuilders but would not yield results overnight. HD Hyundai plans to revive a previously bankrupt shipyard in the Philippines through a 10-year lease, with operations set to launch in January 2026, The Manila Times reported on Wednesday. The Korean conglomerate will invest a total of US$550 million to build up to 10 vessels annually and hire 7,000 workers in total, according to the Korean company Hanjin Heavy Industries first launched the shipyard in Subic Bay in 2006. But operations ceased in 2019 when the company filed for bankruptcy later that year after defaulting on significant loans. On Tuesday, HD Hyundai's shipbuilding division, HD Korea Shipbuilding & Offshore Engineering, also signed a memorandum of understanding with Vietnam Maritime Corporation, the Southeast Asian country's largest state-owned shipping company. The collaboration will focus on developing Vietnam's shipbuilding industry, upgrading and expanding the company's fleet, transferring technology and providing human resources training, according to a press release from the Vietnamese company. In May, Hyundai Vietnam Shipbuilding, announced plans to invest an additional US$100 million to increase its shipbuilding capacity by 50 per cent, aiming to build up to 23 ships per year by 2030.


Qatar Tribune
2 days ago
- Qatar Tribune
Indian exporters eye options to mitigate Trump's tariff salvo
Agencies Indian exporters are scrambling for options to mitigate the fallout of US President Donald Trump's threatened tariff salvo against the world's most populous nation. Many warn of dire job losses after Trump said he would double new import tariffs from 25 percent to 50 percent if India continues to buy Russian oil, in a bid to strip Moscow of revenue for its military offensive in Ukraine. 'At 50 percent tariff, no product from India can stand any competitive edge,' said economist Garima Kapoor from Elara Securities. India, one of the world's largest crude oil importers, has until August 27 to find alternatives to replace around a third of its current oil supply from abroad. While New Delhi is not an export powerhouse, it shipped goods worth about $87 billion to the United States in 2024. That 50 percent levy now threatens to upend low-margin, labor-intensive industries ranging from gems and jewelry to textiles and seafood. The Global Trade Research Initiative estimates a potential 60 percent drop in US sales in 2025 in sectors such as garments. Exporters say they are racing to fulfil orders before the deadline. 'Whatever we can ship before August 27, we are shipping,' said Vijay Kumar Agarwal, chairman of Creative Group. The Mumbai-based textile and garment exporter has a nearly 80 percent exposure to the US market. But Agarwal warned that is merely triage. Shipping goods before the deadline 'doesn't solve' the problem, he said. 'If it doesn't get resolved, there will be chaos,' he said, adding that he's worried for the future of his 15,000 to 16,000 employees. 'It is a very gloomy situation... it will be an immense loss of business.' Talks to resolve the matter hinge on geopolitics, far from the reach of business. Trump is set to meet Vladimir Putin on Friday, the first face-to-face meeting between the two countries' presidents since Russia launched its full-scale invasion of Ukraine in February 2022. New Delhi, with longstanding ties with Moscow, is in a delicate situation. Since Trump's tariff threats, Prime Minister Narendra Modi has spoken to both Putin and Ukrainian President Volodymyr Zelensky, urging a 'peaceful resolution' to the conflict. Meanwhile, the US tariff impact is already being felt in India. Businesses say fresh orders from some US buyers have begun drying up—threatening millions of dollars in future business and the livelihoods of hundreds of thousands in the world's fifth biggest economy. Among India's biggest apparel makers with global manufacturing operations, some are looking to move their US orders elsewhere. Top exporter Pearl Global Industries has told Indian media that some of its US customers asked that orders be produced in lower-duty countries such as Vietnam or Bangladesh, where the company also has manufacturing facilities. Major apparel maker Gokaldas Exports told Bloomberg it may boost production in Ethiopia and Kenya, which have a 10 percent tariff. Moody's recently warned that for India, the 'much wider tariff gap' may 'even reverse some of the gains made in recent years in attracting related investments'. India's gems and jewelry industry exported goods worth more than $10 billion last year and employs hundreds of thousands of people. 'Nothing is happening now, everything is at a standstill, new orders have been put on hold,' Ajesh Mehta from D Navinchandra Exports told AFP. 'We expect up to 150,000 to 200,000 workers to be impacted.' Gems, and other expensive non-essential items, are vulnerable. 'A 10 percent tariff was absorbable - 25 percent is not, let alone this 50 percent,' Mehta added. 'At the end of the day, we deal in luxury products. When the cost goes up beyond a point, customers will cut back.' Seafood exporters, who have been told by some US buyers to hold shipments, are hoping for new customers. 'We are looking to diversify our markets,' says Alex Ninan, who is a partner at the Baby Marine Group. 'The United States is totally out right now. We will have to push our products to alternative markets, such as China, Japan... Russia is another market we are really looking into.' Ninan, however, warns that is far from simple. 'You can't create a market all of a sudden,' he said.