Latest news with #WillOremus


Washington Post
4 days ago
- Business
- Washington Post
AI firms say they can't respect copyright. These researchers tried.
Happy Thursday! I'm Nitasha Tiku, The Washington Post's tech culture reporter, filling in for Will Oremus on today's Tech Brief. Send tips about AI to: AI firms say they can't respect copyright. These researchers tried. As the policy debate over AI and fair use heats up, a new paper suggests there's a more transparent — if time-consuming — alternative to slurping up web content without permission. Top artificial intelligence companies argue that it's impossible to build today's powerful large-language models — the GPT in ChatGPT — unless they can freely scrape copyrighted materials from the internet to train their AI systems. But few AI developers have tried the more ethical route — until now. A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023. A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate. The group built an AI model that is significantly smaller than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools. That could have implications for the policy debate swirling around AI and copyright. The paper itself does not take a position on whether scraping text to train AI is fair use. That debate has reignited in recent weeks with a high-profile lawsuit and dramatic turns around copyright law and enforcement in both the U.S. and U.K. On Wednesday, Reddit said it was suing Anthropic, alleging that it accessed data from the social media discussion board without a licensing agreement, according to The Wall Street Journal. The same day, the U.K.'s House of Commons offered concessions on a controversial bill that would allow AI companies to train on copyrighted material. These moves follow President Donald Trump's firing last month of the head of the U.S. Copyright Office, Shira Perlmutter. Her ouster brought more attention to the office's recent report on AI, which cast doubt on fair use applying to copyrighted works in generative AI. AI companies and their investors, meanwhile, have long argued that a better way is not feasible. In April 2023, Sy Damle, a lawyer representing the venture capital firm Andreessen Horowitz, told the U.S. Copyright Office: 'The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data.' Later that year, in comments to the U.K. government, OpenAI said, '[I]t would be impossible to train today's leading AI models without using copyrighted materials.' And in January 2024, Anthropic's expert witness in a copyright trial asserted that 'the hypothetical competitive market for licenses covering data to train cutting-edge LLMs would be impracticable,' court documents show. While AI policy papers often discuss the need for more open data and experts argue about whether large language models should be trained on licensed data from publishers, there's little effort to put theory into action, the paper's co-author, Aviya Skowron, head of policy at the nonprofit research institute Eleuther AI, told The Post. 'I would also like those people to get curious about what this task actually entails,' Skowron said. As it turns out, the task involves a lot of humans. That's because of the technical challenges of data not being formatted in a way that's machine readable, as well as the legal challenges of figuring out what license applies to which website, a daunting prospect when the industry is rife with improperly licensed data. 'This isn't a thing where you can just scale up the resources that you have available' like access to more computer chips and a fancy web scraper, said Stella Biderman, Eleuther AI's executive director. 'We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard.' Still, the group managed to unearth new datasets that can be used ethically. Those include a set of 130,000 English language books in the Library of Congress, which is nearly double the size of the popular-books dataset Project Gutenberg. The group's initiative also builds on recent efforts to develop more ethical, but still useful, datasets, such as FineWeb from Hugging Face, the open-source repository for machine learning. Eleuther AI pioneered an analogous open-source effort in 2020, creating an often-cited dataset called the Pile. A site that hosted the dataset had to take it down in 2023 after a Digital Millennium Copyright Act request from the Danish anti-piracy group Rights Alliance, which targeted the fact that the Pile contained Books3, a dataset of books that Meta is being sued over. The new dataset is called Common Pile v0.1, and the model is called Comma v0.1 — a deliberate reference to the group's belief that they will be able to find more text that is openly licensed or in the public domain that can then be used to train bigger models. Still, Biderman remained skeptical that this approach could find enough content online to match the size of today's state-of-the-art models. The group of authors represented 14 different institutions, including MIT, CMU, and University of Toronto, as well as other nonprofits such as Vector Institute and the Allen Institute for Artificial Intelligence. Biderman said she didn't expect companies such as OpenAI and Anthropic to start adopting the same laborious process, but she hoped it would encourage them to at least rewind back to 2021 or 2022, when AI companies still shared a few sentences of information about what their models were trained on. 'Even partial transparency has a huge amount of social value and a moderate amount of scientific value,' she said. Musk rails against Trump tax bill, calling it a 'disgusting abomination' (Jacob Bogage and Theodoric Meyer) Federal judge blocks Florida from enforcing social media ban for kids while lawsuit continues (Associated Press) Apple and Alibaba's AI rollout in China delayed by Trump trade war (Financial Times) Trump renegotiating Biden-era Chips Act grants, Lutnick says (Reuters) US removes 'safety' from AI Safety Institute (The Verge) 5 AI bots took our tough reading test. One was smartest — and it wasn't ChatGPT (Geoffrey A. Fowler) You are hardwired to blindly trust AI. Here's how to fight it. (Shira Ovide) Reddit sues Anthropic, alleges unauthorized use of site's data (Wall Street Journal) Amazon to invest $10 billion in North Carolina to expand cloud, AI infrastructure (Reuters) Germans are buying more electric cars, but not Teslas (New York Times) Google warns hackers stealing Salesforce data from companies (Bloomberg) Chinese hacked US Telecom a year before known wireless breaches (Bloomberg) ChatGPT can now read your Google Drive and Dropbox (The Verge) Google DeepMind's CEO thinks AI will make humans less selfish (Wired) The creatives and academics rejecting AI — at work and at home (The Guardian) That's all for today — thank you so much for joining us! Make sure to tell others to subscribe to the Tech Brief. Get in touch with Will (via email or social media) for tips, feedback or greetings!


Washington Post
01-05-2025
- Business
- Washington Post
Google's Pichai says Justice Dept. proposals would cripple its business
Happy Thursday! Julian Mark here, joining Will Oremus on today's Tech Brief after attending the Google trial at D.C.'s federal courthouse yesterday. Send news tips to: Below: The GOP walks back its bid to strip the FTC of antitrust authority. But first: Google's Pichai says DOJ's breakup plan would derail its business.


Washington Post
29-04-2025
- Politics
- Washington Post
Researchers call for new way of thinking about content moderation
Happy Tuesday! I'm Jeremy Merrill, stepping in for my colleague Will Oremus on today's Tech Brief. Send news tips to: Below: Congress passes a revenge porn law, but some advocates are left frustrated. But first: Researchers call for new way of thinking about content moderation Facebook's loosening of its content moderation standards early this year got lots of attention and criticism. But a new study suggests that it might matter less what is taken down than when. The research finds that Facebook posts removed for violating standards or other reasons have already been seen by at least three-quarters of the people who would be predicted to ever see them. 'Content takedowns on Facebook just don't matter all that much, because of how long they take to happen,' said Laura Edelson, an assistant professor of computer science at Northeastern University and the lead author of the paper in the Journal of Online Trust and Safety. Social media platforms generally measure how many bad posts they have taken down as an indication of their efforts to suppress harmful or illegal material. The researchers advocate a new metric: How many people were prevented from seeing a bad post by Facebook taking it down? To measure the 'prevented dissemination' from Facebook's content moderation, the researchers collected more than 1.7 million posts from U.S. news-focused Facebook pages and identified the approximately 13,000 posts of them that had been taken down. 'Removed content we saw was mostly garden-variety spam — ads for financial scams, [multilevel marketing] schemes, that kind of thing,' Edelson said. The predominance of spam and scam content puts the heated public dispute over content moderation and online censorship in context. Civil rights groups have criticized Meta, Facebook's owner, for not taking down enough posts. The Anti-Defamation League last year found that only a tiny fraction of antisemitic posts were removed when reported as a regular Facebook user. Leanna Garfield of the LGBTQ+ advocacy organization GLAAD said: 'Meta takes anywhere from several days to weeks to sometimes even months' to review posts with 'anti-LGBTQ slurs and posts promoting violence.' Before this year's changes, free-expression advocates, in contrast, criticized the company for taking down too much content. 'Arbitrary and unfair enforcement practices … reduce users' confidence both in platforms and in the state of free expression online,' wrote the Foundation for Individual Rights and Expression. Meta CEO Mark Zuckerberg agreed, saying that while there is 'a lot of legitimately bad stuff out there,' the platform's previous policy caused 'too many mistakes and too much censorship.' The new research is a reminder that platforms inadvertently host lots of posts that everyone agrees are bad. The company has said that content that violates its rules is only a tiny fraction of all content on the platform, and that it blocks a lot of bad content before it's successfully posted. It also says it puts substantial effort into stopping fraud and scams. Edelson and her colleagues at research group Cybersecurity for Democracy focused not on whether a removal was justified, but on when posts were removed. Her group identified all posts from about 10,000 U.S. news pages in July 2023 and collected data on the number of times the posts were liked, commented on or otherwise engaged with, every six hours for two days. Popular posts got half of their engagement in less than eight hours, they found. But the typical bad post was only removed after more than 20 hours. After creating a machine learning model that predicts how many engagements a post would likely have after two days, the researchers concluded that Facebook's removals prevented at most only 24 percent of the removed posts' predicted engagement — probably much less. (The paper uses engagement as a proxy for views, because Meta doesn't publish view counts for Facebook posts. Facebook does not disclose why a post was taken down, so it's possible that some of the removed posts were deleted by the poster.) Facebook has a hard job in finding bad posts amid the sea of acceptable ones, and it's all the harder to do so quickly. Edelson, who has collaborated with The Washington Post on a project analyzing TikTok users' feeds, suggests a tighter focus would help. Facebook could better predict which posts its algorithms will show to lots of people and have moderators 'prioritize posts with high predicted future views.' That may not be a silver bullet, but it could help Facebook reduce the amount of bad content it shows to users. Congress passes a revenge porn law, but some advocates are left frustrated The internet is about to get a new federal online safety law, as your usual host Will Oremus reported Monday. But not all online safety advocates are overjoyed. The House of Representatives voted 409-2 on Monday evening to pass the Take It Down Act, which President Donald Trump has already indicated he plans to sign. The act criminalizes the publication of nonconsensual intimate imagery, or NCII, including revenge porn and AI deepfake nudes, and requires online platforms to take it down within 48 hours of a valid report. The vote followed the bill's unanimous passage in February in the Senate, where it was coauthored by Sens. Ted Cruz (R-Texas) and Amy Klobuchar (D-Minnesota). There was no debate on the House floor, where it passed under an expedited process called suspension of rules that requires a two-thirds majority. The bill had support from an unusually wide swath of the political spectrum, from First Lady Melania Trump to Columbia Law Professor Tim Wu. And it thrilled advocates who have experienced the nightmare of NCII themselves, among them Elliston Berry, who was 14 when classmates distributed deepfake nudes of her on Snapchat. 'With the passage of the TAKE IT DOWN Act, we can protect future generations from having to experience the pain I went through,' Berry said in a statement. Still, a contingent of tech policy wonks, including free expression advocates and digital rights groups, was left frustrated. Among them was Mary Anne Franks, president of the Cyber Civil Rights Initiative (CCRI), who has long been a leading advocate of a federal law criminalizing revenge porn. In a statement Monday, CCRI called the criminalization of revenge porn 'long overdue' but said it has 'serious concerns about the constitutionality, efficacy, and potential misuse' of the provision in the Take It Down Act that requires online platforms to remove reported content within 48 hours. Those provisions, the group argued, are 'likely to be selectively and improperly misused for political or ideological purposes that endanger the very communities most affected by image-based sexual abuse.' Wikipedia's nonprofit status questioned by D.C. U.S. attorney (Will Oremus and Julian Mark) Government customer service shake ups have the less tech-savvy on edge (Heather Kelly) Elon Musk had the government in his grasp. Then it unraveled. (Dan Diamond, Faiz Siddiqui, Trisha Thadani and Jeff Stein) Tech tips for Defense Secretary Pete Hegseth (and everyone else) (Heather Kelly) Congress passes Take It Down Act to fight deepfake nudes, revenge porn (Will Oremus) Critics fear the Trump administration could weaponize the Take It Down Act (The Verge) Meta's 'Digital Companions' Will Talk Sex With Users—Even Children (Wall Street Journal) Researchers secretly ran a massive, unauthorized AI persuasion experiment on Reddit users (404 Media) Wall Street banks sell final slug of Elon Musk's X debt (Wall Street Journal) China's Huawei develops new AI chip, seeking to match Nvidia (Wall Street Journal) ChatGPT goes shopping with new product-browsing feature (Ars Technica) Duolingo will replace contract workers with AI (The Verge) These autistic people struggled to make sense of others. Then they found AI. (Andrea Jiménez) The group chats that changed America (Semafor) That's all for today — thank you so much for joining us! Make sure to tell others to subscribe to the Tech Brief. Get in touch with Will (via email or social media) for tips, feedback or greetings!


Washington Post
22-04-2025
- Business
- Washington Post
AI tools mostly fumble basic financial tasks, study finds
Happy Tuesday! I'm Nitasha Tiku, The Washington Post's tech culture reporter, filling in for Will Oremus on today's Tech Brief. Send tips about AI to: AI tools mostly fumble basic financial tasks, study finds There's no shortage of tech leaders predicting that AI will replace humans, fulfilling even complex tasks with speed and accuracy.


Washington Post
25-03-2025
- Health
- Washington Post
Just being well is not enough at Bryan Johnson's Don't Die Summit
Happy Tuesday! I'm Lizza Dwoskin, stepping in for Will Oremus. I'm still unscrambling my brain cells after a weekend experiencing a highly engineered version of humanity's future. Send news tips to: Just being well is not enough at Bryan Johnson's Don't Die Summit Alek Ivanov, president of an HVAC company in Pennsylvania, gives weekly pep talks to his employees — energy drink-chugging air-conditioning repairmen, he calls them — on topics like health and longevity. He has a lot more specific advice to dole out after spending the weekend surrounded by fellow 'biohackers' at the Don't Die Summit.