Nvidia Bets Big on Synthetic Data

WIRED19-03-2025

Mar 19, 2025 11:27 AM Nvidia has acquired synthetic data startup Gretel to bolster the AI training data used by the chip maker's customers and developers. Nvidia CEO Jensen Huang addresses participants at the keynote of CES 2025 in Las Vegas, Nevada. Photograph:Nvidia has acquired synthetic data firm Gretel for nine figures, according to two people with direct knowledge of the deal.
The acquisition price exceeds Gretel's most recent valuation of $320 million, the sources say, though the exact terms of the purchase remain unknown. Gretel and its team of approximately 80 employees will be folded into Nvidia, where its technology will be deployed as part of the chip giant's growing suite of cloud-based, generative AI services for developers.
The acquisition comes as Nvidia has been rolling out synthetic data generation tools, so that developers can train their own AI models and fine-tune them for specific apps. In theory, synthetic data could create a near-infinite supply of AI training data and help solve the data scarcity problem that has been looming over the AI industry since ChatGPT went mainstream in 2022—although experts say using synthetic data in generative AI comes with its own risks.
A spokesperson for Nvidia declined to comment.
Gretel was founded in 2019 by Alex Watson, John Myers, and Ali Golshan, who also serves as CEO. The startup offers a synthetic data platform and a suite of APIs to developers who want to build generative AI models, but don't have access to enough training data or have privacy concerns around using real people's data. Gretel doesn't build and license its own frontier AI models, but fine-tunes existing open source models to add differential privacy and safety features, then packages those together to sell them. The company raised more than $67 million in venture capital funding prior to the acquisition, according to Pitchbook.
Gretel also did not immediately respond to a request for comment from WIRED.
Unlike human-generated or real-world data, synthetic data is computer-generated and designed to mimic real-world data. Proponents say this makes the data generation required to build AI models more scalable, less labor intensive, and more accessible to smaller or less-resourced AI developers. Privacy-protection is another key selling point of synthetic data, making it an appealing option for health care providers, banks, and government agencies.
Nvidia has already been offering synthetic data tools for developers for years. In 2022 it launched Omniverse Replicator, which gives developers the ability to generate custom, physically accurate, synthetic 3D data to train neural networks. Last June, Nvidia began rolling out a family of open AI models that generate synthetic training data for developers to use in building or fine-tuning LLMs. Called Nemotron-4 340B, these mini-models can be used by developers to drum up synthetic data for their own LLMs across 'health care, finance, manufacturing, retail, and every other industry.'
During his keynote presentation at Nvidia's annual developer conference this Tuesday, Nvidia cofounder and chief executive Jensen Huang spoke about the challenges the industry faces in rapidly scaling AI in a cost-effective way.
'There are three problems that we focus on,' he said. 'One, how do you solve the data problem? How and where do you create the data necessary to train the AI? Two, what's the model architecture? And then three, what are the scaling laws?' Huang went on to describe how the company is now using synthetic data generation in its robotics platforms.
Synthetic data can be used in at least a couple different ways, says Ana-Maria Cretu, a postdoctoral researcher at the École Polytechnique Fédérale de Lausanne in Switzerland, who studies synthetic data privacy. It can take the form of tabular data, like demographic or medical data, which can solve a data scarcity issue or create a more diverse dataset.
Cretu gives an example: If a hospital wants to build an AI model to track a certain type of cancer, but is working with a small data set from 1,000 patients, synthetic data can be used to fill out the data set, eliminate biases, and anonymize data from real humans. 'This also offers some privacy protection, whenever you cannot disclose the real data to a stakeholder or software partner,' Cretu says.
But in the world of large language models, Cretu adds, synthetic data has also become something of a catchall phase for 'How can we just increase the amount of data we have for LLMs over time?'
Experts worry that, in the not-so-distant future, AI companies won't be able to gorge as freely on human-created internet data in order to train their AI models. Last year, a report from MIT's Data Provenance Initiative showed that restrictions around open web content were increasing.
Synthetic data in theory could provide an easy solution. But a July 2024 article in Nature highlighted how AI language models could 'collapse,' or degrade significantly in quality, when they're fine-tuned over and over again with data generated by other models. Put another way, if you feed the machine nothing but its own machine-generated output, it theoretically begins to eat itself, spewing out detritus as a result.
Alexandr Wang, the chief executive of Scale AI—which leans heavily on a human workforce for labeling data used to train models—shared the findings from the Nature article on X, writing, 'While many researchers today view synthetic data as an AI philosopher's stone, there is no free lunch.' Wang said later in the thread that this is why he believes firmly in a hybrid data approach.
One of Gretel's cofounders pushed back on the Nature paper, noting in a blog post that the 'extreme scenario' of repetitive training on purely synthetic data 'is not representative of real-world AI development practices.'
Gary Marcus, a cognitive scientist and researcher who loudly criticizes AI hype, said at the time that he agrees with Wang's 'diagnosis but not his prescription.' The industry will move forward, he believes, by developing new architectures for AI models, rather than focusing on the idiosyncrasies of data sets. In an email to WIRED, Marcus observed that 'systems like [OpenAI's] o1/o3 seem to be better at domains like coding and math where you can generate—and validate—tons of synthetic data. On general purpose reasoning in open-ended domains, they have been less effective."
Cretu believes the scientific theory around model collapse is sound. But she notes that most researchers and computer scientists are training on a mix of synthetic and real-world data. 'You might possibly be able to get around model collapse by having fresh data with every new round of training,' she says.
Concerns about model collapse haven't stopped the AI industry from hopping aboard the synthetic data train, even if they're doing so with caution. At a recent Morgan Stanley tech conference, Sam Altman reportedly touted OpenAI's ability to use its existing AI models to create more data. Anthropic CEO Dario Amodei has said he believes it may be possible to build 'an infinite data-generation engine,' one that would maintain its quality by injecting a small amount of new information during the training process (as Cretu has suggested).
Big Tech has also been turning to synthetic data. Meta has talked about how it trained Llama 3, its state-of-the-art large language model, using synthetic data, some of which was generated from Meta's previous model, Llama 2. Amazon's Bedrock platform lets developers use Anthropic's Claude to generate synthetic data. Microsoft's Phi-3 small language model was trained partly on synthetic data, though the company has warned that 'synthetic data generated by pre-trained large-language models can sometimes reduce accuracy and increase bias on down-stream tasks.' Google's DeepMind has been using synthetic data, too, but again, has highlighted the complexities of developing a pipeline for generating—and maintaining—truly private synthetic data.
'We know that all of the big tech companies are working on some aspect of synthetic data,' says Alex Bestall, the founder of Rightsify, a music licensing startup that also generates AI music and licenses its catalog for AI models. 'But human data is often a contractual requirement in our deals. They might want a dataset that is 60 percent human-generated, and 40 percent synthetic.'

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

NVIDIA (NVDA) Gets Buy Rating as AI Data Center Demand Surges Past $50 Billion

Yahoo

an hour ago

Yahoo

NVIDIA (NVDA) Gets Buy Rating as AI Data Center Demand Surges Past $50 Billion

NVIDIA Corporation (NASDAQ:NVDA) is one of the 10 AI Stocks on Wall Street's Radar. One of the most notable analyst calls on Tuesday, June 10, was for NVIDIA Corporation (NASDAQ:NVDA). Bank of America reiterated the stock as 'Buy,' stating that data center demand trends remain robust for Nvidia. 'Developing AI infrastructure leveraging local datasets and workforces is a rapidly growing global phenomenon and we expect it to contribute $50bn+ annually or 10%+ of long-term AI addressable market.' A powerful aerial view of the company's data center, exemplifying the digital services and informatics being offered. In other latest news, Nvidia Corporation and Hewlett-Packard Enterprise said on Tuesday that they are partnering with the Leibniz Supercomputing Centre to build a new supercomputer using Nvidia's next-generation chips. While we acknowledge the potential of NVDA as an investment, we believe certain AI stocks offer greater upside potential and carry less downside risk. If you're looking for an extremely undervalued AI stock that also stands to benefit significantly from Trump-era tariffs and the onshoring trend, see our free report on the best short-term AI stock. READ NEXT: and Disclosure: None.

ChatGPT's Sam Altman sends strong 2-word message on the future

Miami Herald

2 hours ago

Miami Herald

ChatGPT's Sam Altman sends strong 2-word message on the future

As the AI boom continues to take over both the tech industry and the news cycle, there's one thing that's for sure: it's scaring a lot of people. AI is a technically complex topic that can be difficult to explain to the average person, but there's one sentiment that isn't hard to explain at all: the concept that AI might take your job. Don't miss the move: Subscribe to TheStreet's free daily newsletter So rather than try to understand AI's capabilities, or why every major tech company from Meta to Google to Nvidia is pouring billions of dollars into developing it, most people are going to zero in on the part that's personally applicable to them. Related: Cathie Wood has a bold take on AI stealing your job Some voices in the tech space have tried to present an opposite take on the whole "AI making you jobless" rhetoric. Ark Invest CEO Cathie Wood said in a recent tweet, "History shows that new technologies create many more jobs than they displace. We do not think that this time will be different." OpenAI's Sam Altman is easily the AI movement's biggest figurehead, thanks to ChatGPT's runaway success. The company hit three million paid ChatGPT subscribers as of June. This proves that people are flocking to it in droves - and away from search engines. Research firm Gartner has even predicted that by 2026, traditional search engine volume will drop 25%. Now Altman has penned a blog post addressing the topic of AI and how it's changing our world. It's a refreshing take that, for once, will give you some hope about the future of your career. Altman's post emphasizes that compared to any time that has come before, the 2030s can be described with two powerful words: "wildly different." Altman offers a reality check, saying, "We are past the event horizon; the takeoff has started. Humanity is close to building digital superintelligence, and at least so far, it's much less weird than it seems like it should be." "We do not know how far beyond human-level intelligence we can go, but we are about to find out," he continued. More Tech Stocks: Palantir gets great news from the PentagonAnalyst has blunt words on Trump's iPhone tariff plansOpenAI teams up with legendary Apple exec The OpenAI CEO doesn't hesitate to say that his company has recently built systems that are "smarter than people in many ways, and are able to significantly amplify the output of people using them." Altman also says ChatGPT is "already more powerful than any human who has ever lived," a phrase that may feel threatening to some, considering that LLMs are not human to begin with. But Altman sees even more ahead, predicting that AI will significantly mold our future. Related: Microsoft has good news for Elon Musk, bad news for Sam Altman "In the 2030s, intelligence and energy - ideas, and the ability to make ideas happen -are going to become wildly abundant. These two have been the fundamental limiters on human progress for a long time; with abundant intelligence and energy (and good governance), we can theoretically have anything else." Altman also acknowledged that, yes, many jobs will go away as AI continues to evolve, but that won't be the end of the story. "The rate of technological progress will keep accelerating, and it will continue to be the case that people are capable of adapting to almost anything," he says. "There will be very hard parts like whole classes of jobs going away, but on the other hand, the world will be getting so much richer so quickly that we'll be able to seriously entertain new policy ideas we never could before. We probably won't adopt a new social contract all at once, but when we look back in a few decades, the gradual changes will have amounted to something big." Altman also points out a key asset of humanity that AI cannot duplicate, saying, "People have a long-term important and curious advantage over AI: we are hard-wired to care about other people and what they think and do, and we don't care very much about machines." Related: OpenAI teams up with legendary Apple exec The Arena Media Brands, LLC THESTREET is a registered trademark of TheStreet, Inc.

‘This is coming for everyone': A new kind of AI bot takes over the web

Washington Post

2 hours ago

Washington Post

‘This is coming for everyone': A new kind of AI bot takes over the web

People are replacing Google search with artificial intelligence tools like ChatGPT, a major shift that has unleashed a new kind of bot loose on the web. To offer users a tidy AI summary instead of Google's '10 blue links,' companies such as OpenAI and Anthropic have started sending out bots to retrieve and recap content in real time. They are scraping webpages and loading relevant content into the AI's memory and 'reading' far more content than a human ever would.