It turns out you can train AI models without copyrighted material

05-06-2025

AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could — it's just really hard. To prove it, AI researchers trained a new model that's less powerful but much more ethical. That's because the LLM's dataset uses only public domain and openly licensed material.
The paper (via The Washington Post ) was a collaboration between 14 different institutions. The authors represent universities like MIT, Carnegie Mellon and the University of Toronto. Nonprofits like Vector Institute and the Allen Institute for AI also contributed.
The group built an 8 TB ethically-sourced dataset. Among the data was a set of 130,000 books in the Library of Congress. After inputting the material, they trained a seven-billion-parameter large language model (LLM) on that data. The result? It performed about as well as Meta's similarly sized Llama 2-7B from 2023. The team didn't publish benchmarks comparing its results to today's top models.
Performance comparable to a two-year-old model wasn't the only downside. The process of putting it all together was also a grind. Much of the data couldn't be read by machines, so humans had to sift through it. "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people," co-author Stella Biderman told WaPo . "And that's just really hard." Figuring out the legal details also made the process hard. The team had to determine which license applied to each website they scanned.
So, what do you do with a less powerful LLM that's much harder to train? If nothing else, it can serve as a counterpoint.
In 2024, OpenAI told a British parliamentary committee that such a model essentially couldn't exist. The company claimed it would be "impossible to train today's leading AI models without using copyrighted materials." Last year, an Anthropic expert witness added, "LLMs would likely not exist if AI firms were required to license the works in their training datasets."
Of course, this study won't change the trajectory of AI companies. After all, more work to create less powerful tools doesn't jive with their interests. But at least it punctures one of the industry's common arguments. Don't be surprised if you hear about this study again in legal cases and regulation arguments.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Divers reveal images of 321-year-old shipwreck in remarkable condition off coast

Fox News

27 minutes ago

Fox News

Divers reveal images of 321-year-old shipwreck in remarkable condition off coast

A preserved shipwreck from 1703 was recently documented in stunning detail for the first time – with experts cautioning that it may not stay this way for very long. Researchers dove off the coast of Kent, England, to view the wreck of "The Northumberland" last summer. The Stuart-era ship was built in 1679 and sank during the Great Storm of 1703. (See the video at the top of this article.) The site was designated a Protected Wreck Site in 1981 — but it hasn't been seen so clearly until now. Officials revealed the results of the dive on July 31, sharing pictures of a shipwreck covered by marine sediment. The sands have aided the survival of the wreck, which is roughly 50 to 65 feet underwater. The dive was conducted by Historic England, British coastal contractor MSDS Marine and Dan Pascoe, the licensee of the wreck. Historic England told Fox News Digital that divers found an extensive hull structure, exposed deck planks and the wooden frame of the ship – which is "much more than previously thought." Among the finds were multiple wooden chests – some still containing musketballs – as well as one sealed chest with unknown contents. Researchers also came across seven iron cannons, along with copper cauldrons and rope. Experts cited shifting sands as the main threat to the wreck. Hefin Meara, a maritime archaeologist at Historic England, told Fox News Digital the sand on the coast of England is "highly dynamic." Only the most robust materials, such as anchors and iron cannons, tend to survive. "The Goodwin Sands provide an excellent environment for the preservation of organic material, such as ship timbers, rope and other objects," he said. "Once the sand cover migrates away from the wreck site, biological and physical processes can cause the wreck to deteriorate very quickly," he also said. "'The Northumberland' has so much potential to tell us more about the English Navy and ships of the period." He noted that archaeologists will continue focusing on surveying the site instead of removing the artifacts, which could jeopardize the integrity of the site. Pascoe noted that "The Northumberland" "has the potential to be one of the best-preserved wooden warships in the U.K." MSDS's Alison James emphasized the wealth of information that the wreck could provide about the Stuart era. "'The Northumberland' has so much potential to tell us more about the English Navy and ships of the period," she said. Many historic shipwrecks have been found and documented across the United Kingdom in recent years. In Feb. 2024, a teenager found an American Revolution warship on a Scottish beach. More recently, a former military pilot identified a 19th-century shipwreck in the English Channel.

What Elite Tech Students Are Learning from Poetry

Atlantic

2 hours ago

Atlantic

What Elite Tech Students Are Learning from Poetry

One of the highlights of my first three years as a literature professor at MIT—and indeed, of my 15-year career as an educator—has been the recent discovery that some of my students, past and present, formed an arts collective: The People's Poetry. It began, I was told, with the first class I taught at the institute. Several students in that course, ' Reading Poetry: Social Poetics,' created their own group chat, and eventually started meeting outside of class to write together. Every time I taught a new course, their membership grew. These engineers and scientists in training, hailing from across the world, were gathering to compose and critique poems outside the classroom. Many of these young people were, in other classes, studying or even actively developing forms of technology that raise a range of questions about the purpose and power of human expression: why humans write or draw; what ethics govern our inspiration and training; how the creative act brings us together and alters our thinking. In the midst of a technological revolution, while taking on a notoriously difficult courseload, why have they chosen to devote their time to the ancient art of making poems? These kinds of questions are not unprecedented at the institute. In the early 1960s, the reading series Poetry From M.I.T. explored the relationship between a strong technical education and the pursuit of the good and the beautiful. In service of this larger inquiry, the series organizers invited renowned writers such as Robert Penn Warren, Denise Levertov, and Richard Wilbur to campus to share their work. These events were broadcast on WGBH, a Boston public-radio station, and featured timely insights on where the practice of poetry and the future of technology might intersect. My students bring some version of this exploration into the classroom with striking consistency—most vividly in their observations of how it feels to use poetry to work through our obsessions, our dreams, in times like these. And at a place like this, no less: an elite research university where they spend most of their time working on projects that feel orthogonal to that sort of labor. The poet W. S. Merwin once said that you know you are writing a poem when a 'sequence of words starts giving off what you might describe as a kind of electric charge.' I've been thinking about how to place the sort of liveness Merwin describes—the sense of your body as a living circuit that the poem moves through—in a world filling up with noise, marred by misdirection and distraction. When, how, and why do we make room for the miraculous? From moment to moment. In any way we can. Because it is part of the practice of being human. A poem is not merely a record of human activity; it is intended to preserve the complexity, richness, and granular details of our inner lives. Poems provide an occasion for us to talk with one another, creating a shared monument we can carry into the future, establishing a rolling record of our heroes, our planet, our kin. This art form keeps what we love from disappearing. In Odes, the Roman poet Horace writes: 'Many heroes lived before Agamemnon / but they are all unweepable, overwhelmed / by the long night of oblivion / because they lacked a sacred bard.' He is referring to Homer's epic The Iliad, a poem that survived by being passed down through live performance long before it was committed to paper. The preservation of the poem's history, in this case, was a communal affair: from bard to bard, and audience to audience, across time and space. In a moment marked by widespread institutional investment in the promise of artificial intelligence, we should be asking more about not only what AI can and cannot do but what drives the desire for its proliferation: what hope, what sense of longing, boredom, or emptiness. A large language model is a prediction machine. Crucially, it does not think or dream. It establishes the likeliest sequence of words based on its training data and relays it back to you. A well-crafted poem performs a nearly opposite function. It is made from original, dynamic language choices, and it lives and dies on its ability to surprise. It is a means of preserving the particular. And yet I'm led to wonder whether the hunger for connection, understanding, and astonishment that seems to characterize much of the public interest in AI derives from the same needs that poetry fulfills. The AI market thrives in part as a result of our desire for optimization, efficiency. Brevity is among poetry's greatest advantages; a poem can be written in minutes at the bus stop, during a break at work, or in those first quiet moments after dawn. Any occasion can offer inspiration: Gwendolyn Brooks composed the classic eight-line poem 'We Real Cool' after seeing a group of pool players one afternoon in Chicago; Percy Shelley wrote 'Ozymandias' during a competitive exchange of sonnets with a friend. At a book-launch party years ago in downtown Manhattan, I saw Sunni Patterson write a poem on the spot, minutes before going onstage, that incorporated lines from other performers who had recited their work throughout the night. In performance, this was both a mesmerizing display of processing speed and a form of loving citation. The sheer velocity of this kind of language bears a trace of the supernatural. The words can appear to arrive from elsewhere, produced by an elevated consciousness outside our own. A mode of technology that conceals a lack of vetting, understanding, or humanity might bear a resemblance to such a consciousness in moments, but the source of its speed is not—as with Brooks, Shelley, or Patterson—a life spent working toward competence. It does not emerge from centuries of inherited language, or a bond forged for the first time in a room full of strangers and friends. It's worth asking where the warmth of poetry, its connective power across millennia, meets the advances and demands of our technological age. From the invention of writing to the advent of the typewriter to the rise of the personal computer as collaborator, authors have attempted to address this quandary. Twentieth-century poets across a wide aesthetic range— Robert Frost, Sun Ra, X. J. Kennedy, Nikki Giovanni —asked us to consider 'our place among the infinities,' as Frost once put it, the link between our timeless yearning for the stars and the scientific leaps that brought them closer to us. More contemporary writers, including Lillian-Yvonne Bertram and Keith S. Wilson (both of whom are also programmers), have designed works that combine the human voice and the music of machines. Once you know where to look, the overlap is astonishing. One of poetry's greatest gifts is patience—not only with the difficulties of language but with ourselves as its vessels or makers, working to bring a new vision into the world. I see this dynamic firsthand in the form of an assignment I have been offering students for almost a decade now: the end-of-semester adaptation. Therein, I ask them to take a text we've studied over the course of the term and transform it using the tools of a different genre. Essays, short stories, and poems metamorphose into works of choreography, short films, and, on several occasions now, projects that pull from both the digital world and the living environment. Matt, for instance, adapted a Lorraine Hansberry play, What Use Are Flowers?, into a device of his own invention called Melia, which uses a field microphone, an old physics-lab computer, and a neural-net algorithm to meld the human voice with the sounds of the natural world. To truly experience Melia, you have to go outside. You have to find a place by a river, or a grove where the cardinals are talking, or a spot where the breeze is blowing through a tupelo tree, and begin to sing. Suddenly, the voice you have always known is expanded, made new. Yasmeen, in another project, transformed Nikki Giovanni's 'Winter Poem' into a series of digital collages in which people have become flowers while remaining in familiar settings and dress (imagine a bouquet of hydrangea dressed in overalls, standing in front of a farmhouse, or a rush of rhododendron in a blue suit walking down a crowded street, and you might be close). Elizabeth took a third approach. Inspired by a class session on art-making, AI, and human imagination, she proposed a community program: Songbirds. Since her freshman year, Elizabeth has been visiting a local hospice—playing piano for elders, going on walks with them, and learning about their lives. With Songbirds, she wanted to add another element to the visits: the collaborative adaptation of memories into works of art. For this work, she initially thought of employing various AI tools as a primary means of approach. But she eventually decided to also call upon a range of older, more familiar technologies: her trusted piano, notebooks for poems, production software to engineer instrumentals. For anyone at the hospice who might be losing pieces of the past—the story of the moment he met his first great love; the last time he saw his mother alive; the day his daughter was born, said her first word, or first ran across the living room into his arms—a memory could now be preserved, with a bit of assistance, in the form of a song or poem. In work like this, musicians, writers, and engineers all share space. They collaborate in service of human life and the preservation of all we adore. They remind us that poetry has always been a technology of memory and human connection: a way to remind ourselves of who and what we are to one another. Which is something infinitely more than we can say with words, although we must try—and in that striving, be made more lovely, and alive.

Moving Past AI: Building Augmented Intelligence

Forbes

2 hours ago

Forbes

Moving Past AI: Building Augmented Intelligence

James DiNardo is CEO of Like many I have spoken with, my team and I have been thinking deeply about artificial intelligence (AI). It's the story shaping our time, reshaping how businesses run, exciting investors and sparking a worldwide rush for tech leadership. Companies like OpenAI, Google, Meta and xAI are in the thick of it, racing to produce the smartest large language model (LLM) while facing strong rivals from Asia. Markets are buzzing: chip makers, robotics, self-driving cars and related fields are booming. What's amazing is the rise of tools claiming intelligence that match or beat human expertise in specific areas. Take xAI's newest release, Grok 4, which boasts knowledge beyond a PhD in fields from genetics to law, politics to chemistry. Essentially everything. Even exceptional polymathic humans are likely to master only a few areas in a lifetime, limited by time and focus. These language models quickly pull together deep insights across genres. That's revolutionary and some tout that artificial general intelligence (AGI) has already arrived and is in the stages of refinement. AI has become a buzzword in marketing. Some organizations choose descriptors like 'powered by' while using the tactics above. Though it's true that these tools can improve and enhance, there are many levels to implementation. Organizations that choose to only use the technology minimally in order to market with it can undermine confidence in the power of these systems. Not to mention, there are those who believe that the impact on the way we work or live will be so drastic in a few decades we will scarcely remember the way things used to be. We stand on the precipice of dynamic change the likes of which none of us have ever seen. Both positive and negative outcomes are possible. How leaders use AI systems today could shape the future. Some leaders use AI like an old-school search: quick questions for fast answers. A fact check, email rewrite, price comparison or data cleanup. Helpful, but shallow. They're not yet partnering with these tools on a deeper level for real thinking, decisions and future planning. Unfortunately, for most business leaders, we are faced with a dilemma. Learn to adapt or choose to ignore these developments. Perhaps at our own peril. Here's a better way: move past AI. Augmented intelligence means making AI understand, not just reply. It involves creating a "context engine." This is a custom base of your data, history, strategies and unique perspective. The process is time-consuming on the front end, as time must be invested to teach the LLM about background, goals, processes, competitors and inside knowledge. However, the more time invested, the richer the reward. By taking the time to educate the LLM, we can shift its role to act as a trusted advisor who knows your business, speaks your language and shares your values. Over time, my team and I have begun to test our LLM with context questions like, "What do we know so far?" or "How does this fit our plan?" The answers feel custom-made, full of relevant depth. We've continued to add layers by asking our LLMs to consult like a group of experts—Simon Sinek on leadership, Naval Ravikant on choices or Brené Brown on emotions. LLMs can pull known perspectives from these thought leaders to provide rich, blended views for strategy, branding or growth. The uses seem endless. I have even shared my personal goals, my values and what is important to me. Now AI takes these into consideration when I weigh choices that have both business and personal implications. One recent response I received was, 'Sounds fantastic. Consider this strategy will add additional workload to you and the team and could impact time with family you have shared as important. Suggest one approach could be to delegate or hire for this project.' These models often agree with your ideas, which isn't always a good thing. They don't care about your success like you do. One best practice is to ask detailed questions, demand data with sources, then check them yourself or have them checked. We find answers are often correct but not always perfect, and they can echo your biases. We have also found, one or two times, answers that were completely rogue. For this reason, we prefer to ask well-crafted, complex queries in a search for data-based views. Then we use the results to make decisions ourselves or with a team. Trust your instincts. In the end, while everyone's chasing AI, the edge goes to those crafting contextual, strategic augmented systems. In a fast-moving economy, the line between using AI and truly working with it could decide who leads and who follows. Forbes Business Council is the foremost growth and networking organization for business owners and leaders. Do I qualify?

It turns out you can train AI models without copyrighted material

Hashtags

Try Our AI Features

Comments

Related Articles

Divers reveal images of 321-year-old shipwreck in remarkable condition off coast

What Elite Tech Students Are Learning from Poetry

Moving Past AI: Building Augmented Intelligence

Get Started Now: Download the App