logo
#

Latest news with #WillOremus

AI systems ‘ignorant' of sensitive data can be safer, but still smart
AI systems ‘ignorant' of sensitive data can be safer, but still smart

Washington Post

time9 hours ago

  • Science
  • Washington Post

AI systems ‘ignorant' of sensitive data can be safer, but still smart

Happy Tuesday! I'm Nitasha Tiku, The Washington Post's tech culture reporter, filling in for Will Oremus on today's Tech Brief. Send tips about AI via Signal to: nitasha.10 Restricting the information diet of AI software could make it safer. Tech companies including OpenAI and Google have told lawmakers and courts that they must be allowed to grab as much online data as possible to create cutting-edge artificial intelligence systems. New research suggests that screening the information shoved into machine learning algorithms could make it easier to tackle safety concerns about AI. The findings could provide ammunition to regulators who want AI companies to be more transparent and accountable for the choices executives make around the vast troves of data powering generative AI. The research was a collaboration between the British government's AI Security Institute and the nonprofit lab Eleuther AI. They found that filtering the material used to train an AI system to remove key concepts can reduce its ability to help a user work on biohazards, like a novel bioweapon. And that remedy didn't reduce broadly reduce the system's overall capabilities. To test their technique, dubbed 'deep ignorance,' the researchers trained multiple versions of open source AI software for text called Pythia-6.9B, developed by Eleuther. Some were built with copies of a standard dataset of online text that had been filtered to remove potentially hazardous information such as research on enhanced pandemic pathogens, bioterrorism and dual-use virology. In the tests, versions of the AI software built on filtered data scored better on benchmarks designed to test AI capabilities around biorisks. Further experiments showed this didn't come at the cost of reducing the overall performance of the AI system or performance on high-school biology questions, although there was a slight reduction of accuracy on college-level biology questions. The researchers say their methods are not overly burdensome and that their filtering required a less than 1 percent increase in the computing power used to create an AI model. Openly released AI models can be used and modified by anyone, making them hard to monitor or control. But the researchers say their data-filtering technique made it significantly harder to tweak a completed AI model to specialize in bioweapons. The results suggest policymakers may need to question one of the AI industry's long-established narratives. Major AI companies have consistently argued that because recent breakthroughs in AI that yielded products including ChatGPT came from training algorithms on more data, datasets are too colossal to fully document or filter and removing data will make models less useful. The argument goes that safety efforts have to largely focus on adjusting the behavior of AI systems after they have been created. 'Companies sell their data as unfathomably large and un-documentable,' said Eleuther's executive director, Stella Biderman, who spearheaded the project. 'Questioning the design decisions that go into creating models is heavily discouraged.' Demonstrating the effects of filtering massive datasets could prompt demands that AI developers use a similar approach to tackle other potential harms of AI, like nonconsensual intimate imagery, Biderman said. She warned that the study's approach probably worked best in domains like nuclear weapons, where specialized data can be removed without touching general information. Some AI companies have said they already filter training data to improve safety. In reports issued by OpenAI last week about the safety of its most recent AI releases, the ChatGPT maker said it filtered some harmful content out of the training data. For its open source model, GPT-OSS, that included removing content related to 'hazardous biosecurity knowledge.' For its flagship GPT-5 release, the company said its efforts included using 'advanced data filtering' to reduce the amount of personal information in its training data. But the company has not offered details about what that filtering involved or what data it removed, making it difficult for outsiders to check or build on its work. In response to questions, OpenAI cited the two safety testing reports. Biderman said Eleuther is already starting to explore how to demonstrate safety techniques that are more transparent than existing efforts, which she said are 'not that hard to remove.' Trump's chip deal sets new pay-to-play precedent for U.S. exporters (Gerrit De Vynck and Jacob Bogage) Nvidia, AMD agree to pay U.S. government 15% of AI chip sales to China (Eva Dou and Grace Moon) Intel CEO to visit White House on Monday, source says (Reuters) Brazil kept tight rein on Big Tech. Trump's tariffs could change that. (New York Times) Top aide to Trump and Musk seeks even greater influence as a podcaster (Tatum Hunter) New chatbot on Trump's Truth Social platform keeps contradicting him (Drew Harwell) End is near for the landline-based service that got America online in the '90s (Ben Brasch) Meta makes conservative activist an AI bias advisor following lawsuit (The Verge) GitHub CEO Thomas Dohmke to step down, plans new startup (Reuters) Reddit blocks Internet Archive to end sneaky AI scraping (Ars Technica) Why A.I. should make parents rethink posting photos of their children online (New York Times) Wikipedia loses UK Safety Act challenge, worries it will have to verify user IDs (Ars Technica) These workers don't fear artificial intelligence. They're getting degrees in it. (Danielle Abril) Labor unions mobilize to challenge advance of algorithms in workplaces (Danielle Abril) That's all for today — thank you so much for joining us! Make sure to tell others to subscribe to the Tech Brief. Get in touch with Will (via email or social media) for tips, feedback or greetings!

An ambitious new project aims to win back the U.S. lead in open-source AI from China
An ambitious new project aims to win back the U.S. lead in open-source AI from China

Washington Post

time05-08-2025

  • Business
  • Washington Post

An ambitious new project aims to win back the U.S. lead in open-source AI from China

Happy Tuesday! I'm Nitasha Tiku, The Washington Post's tech culture reporter, filling in for Will Oremus on today's Tech Brief. Send tips about AI via Signal to: nitasha.10 An ambitious new project aims to win back the U.S. lead in open-source AI technology from China. Leaders in Silicon Valley and Washington often say the United States must beat China in artificial intelligence to protect its economy and national security. But U.S. companies are falling behind their Chinese counterparts in one part of the multifaceted AI race.

Silicon Valley bet on Trump. It's starting to pay off.
Silicon Valley bet on Trump. It's starting to pay off.

Washington Post

time24-07-2025

  • Business
  • Washington Post

Silicon Valley bet on Trump. It's starting to pay off.

Happy Thursday! I'm Margot Amouyal, a news intern at The Washington Post, rounding up this week's top tech news with help from Andrea Jiménez. Don't forget to send news tips to my colleague Will Oremus at A big week for tech leaders in Washington As the White House on Wednesday revealed its plan to help the United States lead a global race to develop artificial intelligence, President Donald Trump signed three executive orders intended to boost the American tech sector, our colleagues Cat Zakrzewski and Hannah Natanson report. Together, the actions will facilitate exports of U.S. technologies and boost the build-out of data centers — advancing the agenda of executives and investors seeking to cash in on an AI gold rush. Trump announced the plan at an event co-hosted by the Hill and Valley Forum, an influential interest group founded by tech leaders, and 'All-In,' a popular Silicon Valley podcast co-hosted by White House AI and crypto czar David Sacks. 'America must once again be a country where innovators are rewarded with a green light, not strangled with red tape,' Trump said to an audience of administration officials and executives, including Nvidia CEO Jensen Huang and tech investor Chamath Palihapitiya. The tech leaders cheered as Trump discussed executive orders intended to combat excessive regulation. Administration officials later attended an after-party organized by the Hill and Valley Forum's co-founders at the upscale, members-only Ned's Club in D.C., according to an invitation viewed by The Washington Post. OpenAI CEO Sam Altman was slated to speak briefly at the party, according to a person familiar with the plans who spoke on the condition of anonymity to discuss the private event. (The Post has a content partnership with OpenAI.) Trump has flaunted his administration's connections to the industry as a display of innovation and economic power. But consumer advocates warn that industries should not be able to write their own rules, amid concerns that AI could kill jobs, harm the environment and exacerbate existing social biases. Meanwhile, more details are emerging about the global breach of Microsoft server software. Washington Post reporters Ellen Nakashima, Joseph Menn and Carolyn Y. Johnson have reported that the National Institutes of Health was among the targets in the breach. An investigation is underway to assess the scope and severity of the attack. The National Nuclear Security Administration, the federal agency responsible for securing the nation's nuclear weapons, including 5,000 warheads, also was targeted. A person familiar with the matter said no classified information was exposed in the breach. Hackers with connections to the Chinese government are behind at least some of the global Microsoft server breaches, particularly in its SharePoint system, which is used to coordinate work on documents and projects, Menn and Nakashima reported. Hegseth Signal messages came from email classified 'SECRET,' watchdog told (Dan Lamothe and John Hudson) Two FTC commissioners are turning their firings into a resistance tour (Politico) How Trump's war on clean energy is making AI a bigger polluter (The Verge) Trade group asks Supreme Court to limit Mississippi's social media law (The Hill) YouTube Shorts is adding an image-to-video AI tool, new AI effects (TechCrunch) Amazon shuts down Shanghai AI research lab (Financial Times) Meta updates safety features for teens. More than 600,000 accounts linked to predatory behavior (CNBC) Microsoft poaches top Google DeepMind staff in AI talent war (Financial Times) Tesla earnings show ongoing fallout from Musk's breakup with Trump (Trisha Thadani and Faiz Siddiqui) U.K. regulator seeks special status for Apple and Google that could mandate changes for Big Tech (Associated Press) Uber tests option in the U.S. to match female riders and drivers (Bloomberg) Trump administration leans in on memes, AI and MAGA messaging online (NBC News) Teens say they are turning to AI for friendship (Associated Press) That's all for today — thank you so much for joining us! Make sure to tell others to subscribe to the Tech Brief. Get in touch with Will (via email or social media) for tips, feedback or greetings!

AI firms say they can't respect copyright. These researchers tried.
AI firms say they can't respect copyright. These researchers tried.

Washington Post

time05-06-2025

  • Business
  • Washington Post

AI firms say they can't respect copyright. These researchers tried.

Happy Thursday! I'm Nitasha Tiku, The Washington Post's tech culture reporter, filling in for Will Oremus on today's Tech Brief. Send tips about AI to: AI firms say they can't respect copyright. These researchers tried. As the policy debate over AI and fair use heats up, a new paper suggests there's a more transparent — if time-consuming — alternative to slurping up web content without permission. Top artificial intelligence companies argue that it's impossible to build today's powerful large-language models — the GPT in ChatGPT — unless they can freely scrape copyrighted materials from the internet to train their AI systems. But few AI developers have tried the more ethical route — until now. A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023. A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate. The group built an AI model that is significantly smaller than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools. That could have implications for the policy debate swirling around AI and copyright. The paper itself does not take a position on whether scraping text to train AI is fair use. That debate has reignited in recent weeks with a high-profile lawsuit and dramatic turns around copyright law and enforcement in both the U.S. and U.K. On Wednesday, Reddit said it was suing Anthropic, alleging that it accessed data from the social media discussion board without a licensing agreement, according to The Wall Street Journal. The same day, the U.K.'s House of Commons offered concessions on a controversial bill that would allow AI companies to train on copyrighted material. These moves follow President Donald Trump's firing last month of the head of the U.S. Copyright Office, Shira Perlmutter. Her ouster brought more attention to the office's recent report on AI, which cast doubt on fair use applying to copyrighted works in generative AI. AI companies and their investors, meanwhile, have long argued that a better way is not feasible. In April 2023, Sy Damle, a lawyer representing the venture capital firm Andreessen Horowitz, told the U.S. Copyright Office: 'The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data.' Later that year, in comments to the U.K. government, OpenAI said, '[I]t would be impossible to train today's leading AI models without using copyrighted materials.' And in January 2024, Anthropic's expert witness in a copyright trial asserted that 'the hypothetical competitive market for licenses covering data to train cutting-edge LLMs would be impracticable,' court documents show. While AI policy papers often discuss the need for more open data and experts argue about whether large language models should be trained on licensed data from publishers, there's little effort to put theory into action, the paper's co-author, Aviya Skowron, head of policy at the nonprofit research institute Eleuther AI, told The Post. 'I would also like those people to get curious about what this task actually entails,' Skowron said. As it turns out, the task involves a lot of humans. That's because of the technical challenges of data not being formatted in a way that's machine readable, as well as the legal challenges of figuring out what license applies to which website, a daunting prospect when the industry is rife with improperly licensed data. 'This isn't a thing where you can just scale up the resources that you have available' like access to more computer chips and a fancy web scraper, said Stella Biderman, Eleuther AI's executive director. 'We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard.' Still, the group managed to unearth new datasets that can be used ethically. Those include a set of 130,000 English language books in the Library of Congress, which is nearly double the size of the popular-books dataset Project Gutenberg. The group's initiative also builds on recent efforts to develop more ethical, but still useful, datasets, such as FineWeb from Hugging Face, the open-source repository for machine learning. Eleuther AI pioneered an analogous open-source effort in 2020, creating an often-cited dataset called the Pile. A site that hosted the dataset had to take it down in 2023 after a Digital Millennium Copyright Act request from the Danish anti-piracy group Rights Alliance, which targeted the fact that the Pile contained Books3, a dataset of books that Meta is being sued over. The new dataset is called Common Pile v0.1, and the model is called Comma v0.1 — a deliberate reference to the group's belief that they will be able to find more text that is openly licensed or in the public domain that can then be used to train bigger models. Still, Biderman remained skeptical that this approach could find enough content online to match the size of today's state-of-the-art models. The group of authors represented 14 different institutions, including MIT, CMU, and University of Toronto, as well as other nonprofits such as Vector Institute and the Allen Institute for Artificial Intelligence. Biderman said she didn't expect companies such as OpenAI and Anthropic to start adopting the same laborious process, but she hoped it would encourage them to at least rewind back to 2021 or 2022, when AI companies still shared a few sentences of information about what their models were trained on. 'Even partial transparency has a huge amount of social value and a moderate amount of scientific value,' she said. Musk rails against Trump tax bill, calling it a 'disgusting abomination' (Jacob Bogage and Theodoric Meyer) Federal judge blocks Florida from enforcing social media ban for kids while lawsuit continues (Associated Press) Apple and Alibaba's AI rollout in China delayed by Trump trade war (Financial Times) Trump renegotiating Biden-era Chips Act grants, Lutnick says (Reuters) US removes 'safety' from AI Safety Institute (The Verge) 5 AI bots took our tough reading test. One was smartest — and it wasn't ChatGPT (Geoffrey A. Fowler) You are hardwired to blindly trust AI. Here's how to fight it. (Shira Ovide) Reddit sues Anthropic, alleges unauthorized use of site's data (Wall Street Journal) Amazon to invest $10 billion in North Carolina to expand cloud, AI infrastructure (Reuters) Germans are buying more electric cars, but not Teslas (New York Times) Google warns hackers stealing Salesforce data from companies (Bloomberg) Chinese hacked US Telecom a year before known wireless breaches (Bloomberg) ChatGPT can now read your Google Drive and Dropbox (The Verge) Google DeepMind's CEO thinks AI will make humans less selfish (Wired) The creatives and academics rejecting AI — at work and at home (The Guardian) That's all for today — thank you so much for joining us! Make sure to tell others to subscribe to the Tech Brief. Get in touch with Will (via email or social media) for tips, feedback or greetings!

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store