Latest news with #Llama2-7B

It turns out you can train AI models without copyrighted material

Engadget

05-06-2025

Business
Engadget

It turns out you can train AI models without copyrighted material

AI companies claim their tools couldn't exist without training on copyrighted material. It turns out, they could — it's just really hard. To prove it, AI researchers trained a new model that's less powerful but much more ethical. That's because the LLM's dataset uses only public domain and openly licensed material. The paper (via The Washington Post ) was a collaboration between 14 different institutions. The authors represent universities like MIT, Carnegie Mellon and the University of Toronto. Nonprofits like Vector Institute and the Allen Institute for AI also contributed. The group built an 8 TB ethically-sourced dataset. Among the data was a set of 130,000 books in the Library of Congress. After inputting the material, they trained a seven-billion-parameter large language model (LLM) on that data. The result? It performed about as well as Meta's similarly sized Llama 2-7B from 2023. The team didn't publish benchmarks comparing its results to today's top models. Performance comparable to a two-year-old model wasn't the only downside. The process of putting it all together was also a grind. Much of the data couldn't be read by machines, so humans had to sift through it. "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people," co-author Stella Biderman told WaPo . "And that's just really hard." Figuring out the legal details also made the process hard. The team had to determine which license applied to each website they scanned. So, what do you do with a less powerful LLM that's much harder to train? If nothing else, it can serve as a counterpoint. In 2024, OpenAI told a British parliamentary committee that such a model essentially couldn't exist. The company claimed it would be "impossible to train today's leading AI models without using copyrighted materials." Last year, an Anthropic expert witness added, "LLMs would likely not exist if AI firms were required to license the works in their training datasets." Of course, this study won't change the trajectory of AI companies. After all, more work to create less powerful tools doesn't jive with their interests. But at least it punctures one of the industry's common arguments. Don't be surprised if you hear about this study again in legal cases and regulation arguments.

AI firms say they can't respect copyright. These researchers tried.

Washington Post

05-06-2025

Business
Washington Post

AI firms say they can't respect copyright. These researchers tried.

Happy Thursday! I'm Nitasha Tiku, The Washington Post's tech culture reporter, filling in for Will Oremus on today's Tech Brief. Send tips about AI to: AI firms say they can't respect copyright. These researchers tried. As the policy debate over AI and fair use heats up, a new paper suggests there's a more transparent — if time-consuming — alternative to slurping up web content without permission. Top artificial intelligence companies argue that it's impossible to build today's powerful large-language models — the GPT in ChatGPT — unless they can freely scrape copyrighted materials from the internet to train their AI systems. But few AI developers have tried the more ethical route — until now. A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023. A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate. The group built an AI model that is significantly smaller than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools. That could have implications for the policy debate swirling around AI and copyright. The paper itself does not take a position on whether scraping text to train AI is fair use. That debate has reignited in recent weeks with a high-profile lawsuit and dramatic turns around copyright law and enforcement in both the U.S. and U.K. On Wednesday, Reddit said it was suing Anthropic, alleging that it accessed data from the social media discussion board without a licensing agreement, according to The Wall Street Journal. The same day, the U.K.'s House of Commons offered concessions on a controversial bill that would allow AI companies to train on copyrighted material. These moves follow President Donald Trump's firing last month of the head of the U.S. Copyright Office, Shira Perlmutter. Her ouster brought more attention to the office's recent report on AI, which cast doubt on fair use applying to copyrighted works in generative AI. AI companies and their investors, meanwhile, have long argued that a better way is not feasible. In April 2023, Sy Damle, a lawyer representing the venture capital firm Andreessen Horowitz, told the U.S. Copyright Office: 'The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data.' Later that year, in comments to the U.K. government, OpenAI said, '[I]t would be impossible to train today's leading AI models without using copyrighted materials.' And in January 2024, Anthropic's expert witness in a copyright trial asserted that 'the hypothetical competitive market for licenses covering data to train cutting-edge LLMs would be impracticable,' court documents show. While AI policy papers often discuss the need for more open data and experts argue about whether large language models should be trained on licensed data from publishers, there's little effort to put theory into action, the paper's co-author, Aviya Skowron, head of policy at the nonprofit research institute Eleuther AI, told The Post. 'I would also like those people to get curious about what this task actually entails,' Skowron said. As it turns out, the task involves a lot of humans. That's because of the technical challenges of data not being formatted in a way that's machine readable, as well as the legal challenges of figuring out what license applies to which website, a daunting prospect when the industry is rife with improperly licensed data. 'This isn't a thing where you can just scale up the resources that you have available' like access to more computer chips and a fancy web scraper, said Stella Biderman, Eleuther AI's executive director. 'We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard.' Still, the group managed to unearth new datasets that can be used ethically. Those include a set of 130,000 English language books in the Library of Congress, which is nearly double the size of the popular-books dataset Project Gutenberg. The group's initiative also builds on recent efforts to develop more ethical, but still useful, datasets, such as FineWeb from Hugging Face, the open-source repository for machine learning. Eleuther AI pioneered an analogous open-source effort in 2020, creating an often-cited dataset called the Pile. A site that hosted the dataset had to take it down in 2023 after a Digital Millennium Copyright Act request from the Danish anti-piracy group Rights Alliance, which targeted the fact that the Pile contained Books3, a dataset of books that Meta is being sued over. The new dataset is called Common Pile v0.1, and the model is called Comma v0.1 — a deliberate reference to the group's belief that they will be able to find more text that is openly licensed or in the public domain that can then be used to train bigger models. Still, Biderman remained skeptical that this approach could find enough content online to match the size of today's state-of-the-art models. The group of authors represented 14 different institutions, including MIT, CMU, and University of Toronto, as well as other nonprofits such as Vector Institute and the Allen Institute for Artificial Intelligence. Biderman said she didn't expect companies such as OpenAI and Anthropic to start adopting the same laborious process, but she hoped it would encourage them to at least rewind back to 2021 or 2022, when AI companies still shared a few sentences of information about what their models were trained on. 'Even partial transparency has a huge amount of social value and a moderate amount of scientific value,' she said. Musk rails against Trump tax bill, calling it a 'disgusting abomination' (Jacob Bogage and Theodoric Meyer) Federal judge blocks Florida from enforcing social media ban for kids while lawsuit continues (Associated Press) Apple and Alibaba's AI rollout in China delayed by Trump trade war (Financial Times) Trump renegotiating Biden-era Chips Act grants, Lutnick says (Reuters) US removes 'safety' from AI Safety Institute (The Verge) 5 AI bots took our tough reading test. One was smartest — and it wasn't ChatGPT (Geoffrey A. Fowler) You are hardwired to blindly trust AI. Here's how to fight it. (Shira Ovide) Reddit sues Anthropic, alleges unauthorized use of site's data (Wall Street Journal) Amazon to invest $10 billion in North Carolina to expand cloud, AI infrastructure (Reuters) Germans are buying more electric cars, but not Teslas (New York Times) Google warns hackers stealing Salesforce data from companies (Bloomberg) Chinese hacked US Telecom a year before known wireless breaches (Bloomberg) ChatGPT can now read your Google Drive and Dropbox (The Verge) Google DeepMind's CEO thinks AI will make humans less selfish (Wired) The creatives and academics rejecting AI — at work and at home (The Guardian) That's all for today — thank you so much for joining us! Make sure to tell others to subscribe to the Tech Brief. Get in touch with Will (via email or social media) for tips, feedback or greetings!

Latest news with #Llama2-7B

It turns out you can train AI models without copyrighted material

AI firms say they can't respect copyright. These researchers tried.

Get Started Now: Download the App