Latest news with #syntheticdata

AI Development Issues Synthetic Data Can Help You Overcome

Forbes

07-07-2025

Business
Forbes

AI Development Issues Synthetic Data Can Help You Overcome

As AI systems become more sophisticated, the challenges of training them effectively—and responsibly—continue to grow. The use of real-world data often comes with concerns and roadblocks—privacy risks, inconsistent formats, gaps in edge cases and regulatory hurdles can all slow development or skew outcomes. Synthetic data offers a promising alternative, delivering clean, scalable and customizable datasets that can augment—or even replace—traditional data in key use cases. Below, members of Forbes Technology Council share real-world challenges that come with training AI systems and how synthetic data can help address them. Their insights highlight how developers can overcome data-related barriers while building smarter, safer AI models. 1. Lack Of Edge Case Data Synthetic data can help address the challenge of edge cases in your real-world data, which, by definition, doesn't have enough examples to create a training set. The real-world data can be used to identify an edge case your AI may encounter, but you leverage synthetic data to create variations of that edge case for machine learning. This hybrid approach is often most effective in terms of cost, time and so on. - Radha Basu, iMerit 2. Inconsistency And Lack Of Control One of the major challenges is inconsistency and lack of control. Real-world data is messy, biased and often incomplete, making it hard to scale or use reliably in training high-performance models. Synthetic data solves this by offering precision, balance and control at scale. Synthetic data gives AI developers the ability to test, stress and scale models in ways real-world data simply can't match. - Alexandre de Vigan, Nfinite Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify? 3. Unpredictable Training Environments Remember the debate about synthetic versus petroleum-based oil? Similar to how older engines were run with nonsynthetic oil, legacy businesses use messy, sensitive and unpredictable real-world data, resulting in poor AI. For smart, modern businesses, synthetic data, like synthetic oil, allows developers to train models in predictable and controllable environments, ensuring strong performance in the real world. - Robert Clark, Cloverleaf Analytics 4. Privacy And Scale Limitations Synthetic data avoids sensitive personal information, which reduces privacy concerns, and can be generated efficiently at scale, making it ideal for training large models. Real-world data—like patient records—is often messy and unpredictable. However, training only on synthetic data can limit a model's ability to perform well in complex, real-world scenarios. - Tim O'Connell, emtelligent 5. Incomplete And Biased Datasets One challenge with real-world data is that it can be biased or incomplete—like teaching someone to drive only on city streets but not highways or country roads. Synthetic data fills these gaps, adding detail and diversity that real-world data might lack. However, synthetic data only leads to smarter, fairer, more reliable AI if it's high-quality and generated with standards to minimize bias. - China Widener, Deloitte 6. Sensitive Industry Restrictions AI developers face the significant challenge of addressing issues related to data privacy. Obtaining relevant information within sensitive industries presents unique difficulties, especially when dealing with regulated elements. Synthetic data helps alleviate this burden. - Michael Gargiulo, 7. 'Class Imbalance' Real data has bias due to 'class imbalance.' Imagine a scenario where using AI for résumé screening fails since the model has been unintentionally trained on a dominant class (favoring male over female or giving more weight to a certain demographic), as that's what was available as historical data. Synthetic data can overcome this, as long as proper context is given for data generation. - Arjun Srinivasan, Wesco 8. Scarcity Of Rare Scenarios I believe that AI developers often struggle with obtaining large, diverse datasets that include rare edge cases—such as unusual driving scenarios for autonomous vehicles—which are difficult and costly to capture in the real world. Synthetic data can generate these rare but critical conditions at scale, improving model robustness without compromising user privacy. - Mark Vena, SmartTech Research 9. Privacy And Distribution Barriers Real-world data often comes with privacy constraints and regulatory friction. Synthetic data solves distribution gaps in training sets by generating edge cases for mission-critical systems. It allows AI developers to simulate realistic, diverse datasets without exposing sensitive information. - Andrey Kalyuzhnyy, 8allocate 10. Rare Event Modeling Needs For AI models, the rule is often 'the more data, the better the model.' However, if you are trying to model events that rarely happen, such as communications involving insider trading, bribery, harassment and other such events, the only way to get enough data is to create it. By using synthetic data that is then reviewed by subject matter experts, you can have enough examples to create great models. - Vall Herard, Saifr 11. Voice Diversity Challenges One key challenge with real-world voice data is obtaining sufficient diversity and volume, especially for rare accents, speaking styles or noisy environments. Synthetic voice data overcomes this by generating limitless, tailored examples, including difficult-to-capture scenarios, without privacy concerns. This enables training more robust AI models. - Harshal Shah 12. High Data Acquisition Costs The simple acquisition of real-world data can be difficult and costly. Synthetic data can help, but it needs to be evaluated carefully for quality and assessed for its potential impact on the training of a model. - Leonard Lee, neXt Curve 13. Privacy-Conscious Experimentation Real-world data often limits innovation due to privacy and regulatory barriers. Synthetic data helps AI developers simulate edge cases and future scenarios that don't yet exist. This enables safer experimentation, faster iteration and smarter models without compromising sensitive information. - Rishi Kumar, MatchingFit 14. Ignored Edge Users In CX Data Real-world customer experience data often ignores edge users—the 'silent majority' who never complain; they just leave. Synthetic data enables you to simulate and operationalize a retention strategy before it's too late. It's not just a data problem; it's a CX risk. - April Ho-Nishimura, Infineon Technologies AG 15. Corrupted Or Low-Quality Real Data The use of corrupted real-world datasets for training can silently compromise AI models and cause unreliable results. Synthetic data eliminates this risk by providing clean, controlled datasets when real-world data quality issues are affecting model performance. - Chongwei Chen, DataNumen, Inc. 16. Simulating Future Scenarios Real-world data is stuck in yesterday's world—permissioned, fragmented and slow. Synthetic data isn't just a privacy workaround; it's a simulation engine. Developers can now model edge-case chaos, future scenarios or AI-on-AI interactions at scale, long before reality catches up. That's not a patch. That's evolution. - Akhilesh Sharma, A3Logics Inc. 17. Cross-Silo Collaboration Barriers AI developers attempting to collaborate across silos (such as government agencies) where data sharing is challenging or explicitly forbidden are able to exchange synthetic datasets. This improves model stability and time to release by allowing multiple parties to share in the model evaluation process and reproduce bugs to broaden the troubleshooting audience. - Matthew Peters, CAI 18. Reactive Versus Proactive Modeling A major challenge with real-world data is its stagnancy. It reflects what has been, not what could be. Synthetic data allows AI developers to generate rich, forward-looking scenarios that model emerging trends, unseen behaviors or disruptive events. It shifts AI from reactive to proactive, enabling systems to anticipate and adapt in a world that evolves faster than yesterday's data. - Sandipan Biswas 19. Inconsistent Labeling Real-world data is messy. Labels are often inconsistent, even among experts, and that noise quietly limits how far your models can go. Synthetic data gives us clean, perfectly labeled ground truth. We use it to identify annotation errors and train models that handle uncertainty more effectively and surpass accuracy ceilings—without incurring the costs of relabeling. - Gavita Regunath, Advancing Analytics

Genesis AI launches with $105M seed funding from Eclipse, Khosla to build AI models for robots

TechCrunch

01-07-2025

Business
TechCrunch

Genesis AI launches with $105M seed funding from Eclipse, Khosla to build AI models for robots

Genesis AI, a startup that aims to build a foundational model for powering all kinds of robots, has emerged from stealth with a giant $105 million seed round co-led by Eclipse Ventures and Khosla Ventures. Founded last December by Zhou Xian (pictured above, left), a Ph.D. in robotics from Carnegie Mellon University, and Théophile Gervet, a former research scientist with the French AI lab Mistral, the startup wants to build a general-purpose model that will enable robots to automate a wide range of repetitive tasks, from lab work to housekeeping. Large language models are trained on vast datasets of text, but AI models for robotics must be trained on data on the physical world. However, acquiring that real-world data makes for a costly and time-consuming endeavor. To overcome that, Genesis is turning to synthetic data, which it generates using a proprietary physics engine that, it says, is capable of accurately modelling the physical world. Genesis' synthetic data engine originated from an academic project that Xian led in collaboration with researchers from 18 universities. Several participants from that project have since joined Genesis, making up its current staff of over 20 researchers who specialize in robotics, machine learning and graphics. Genesis claims its proprietary simulation engine allows it to develop models faster, a distinct advantage over competitors who rely on NVIDIA's software. Other companies working on developing general-purpose AI models for robots include Physical Intelligence, which raised a $400 million round; and Skild AI, which was valued at $4 billion earlier this year. Techcrunch event Save $450 on your TechCrunch All Stage pass Build smarter. Scale faster. Connect deeper. Join visionaries from Precursor Ventures, NEA, Index Ventures, Underscore VC, and beyond for a day packed with strategies, workshops, and meaningful connections. Save $200+ on your TechCrunch All Stage pass Build smarter. Scale faster. Connect deeper. Join visionaries from Precursor Ventures, NEA, Index Ventures, Underscore VC, and beyond for a day packed with strategies, workshops, and meaningful connections. Boston, MA | REGISTER NOW 'It's a big unknown: Will anybody have a large robotics foundation model that will generalize across tasks? That's a bet we want to go after,' Kanu Gulati, a partner at Khosla Ventures, told TechCrunch. 'Of all the teams we have seen, we like [Genesis's] approach for going after robotics foundation models,' she added. Genesis is developing its synthetic data and building the foundational model across two offices, in Silicon Valley and Paris. As the next milestone, Genesis plans to release its model to the robotics community by the end of the year.

Forbes

20-06-2025

Business
Forbes

Great AI Needs Great (Synthetic) Data

Jennifer Chase is Chief Marketing Officer and Executive Vice President at SAS. Every year, I am asked what marketing innovation I am most excited about, and for 2025, my answer may be surprising. I know you're probably expecting me to say AI agents or AI-created interactive marketing assets, but bear with me as I explain just why I think synthetic data generation should be the most hotly anticipated tech by marketers this year. As marketers, we are not data poor. However, we are data starved. And by that, I mean marketers are starved of cost-effective, high-quality data that we can use to create hyper-personalized marketing. For AI models to effectively run, the model input data must be complete and of good quality. And too often, our datasets have gaping holes. Synthetic data generation is a component of generative AI (GenAI), and with this tech, marketers can generate artificial datasets that share the attributes and characteristics of real customer data, but without any liabilities and limitations. According to Gartner, 'By 2026, 75% of businesses will use generative AI to create synthetic customer data, up from less than 5% in 2023.' Why is this important? Well, for marketers, I believe there are three main reasons: We need good quality data for the development of AI applications. However, this can be a challenge when privacy considerations and regulations are of utmost importance. Synthetic data can help with data privacy by creating data with the same patterns as real data, but with none of the identifying information. This level of data anonymity can help us safeguard personal data. As communications and marketing leaders, we are the trusted stewards of customer data, and I am excited about the role synthetic data can play in helping us protect it. Eradicating bias in our datasets should be a paramount consideration for all marketers. Not only is it unethical, but it also leads to inaccurate analyses that can negatively affect campaign and customer journey effectiveness. The wonder of synthetic data generation is that we can create more representative datasets. For instance, certain groups may be underrepresented, leading to biased model predictions. However, using synthetic data generation, we can create supplementary data for underrepresented groups, ensuring a fair distribution. Additionally, synthetic data can be designed to exclude biases that are often present in datasets. Organizations spend a lot of time acquiring and preparing data. And it's not a one-time process. Data decays. The generation of synthetic data can help limit some of the associated costs that come with that decay. A great way to improve efficiency using synthetic data in marketing is using it to perform look-alike modeling. Using generated data with the same features, structures and attributes as real-life datasets can help brands identify new audiences quickly and at-scale. Something marketers probably don't spend much time thinking about is the cost of data labeling. This is a hidden cost associated with data analysis. Annotating large datasets is time-consuming and expensive. When using data-generation technology, make sure it's designed to include data labeling automatically. Synthetic data has tremendous upside, from privacy protection to mitigating bias and reducing costs, all while improving overall marketing effectiveness. However, with this potential comes responsibility. Marketers must establish clear governance within their organization around when to use synthetic data. Beyond this, make sure you have defined guidelines for labeling and identifying the use of synthetic data to avoid misuse and misunderstanding. As a CMO, I'm always looking for ways to reduce costs while not reducing effectiveness, and synthetic data fits this bill for me. With the myriad ways it can aid marketing, especially in rapid experimentation, I believe synthetic data is going to cement its place in the continued evolution of marketing. Forbes Communications Council is an invitation-only community for executives in successful public relations, media strategy, creative and advertising agencies. Do I qualify?

Tahawul Tech

19-06-2025

Health
Tahawul Tech

drug discovery Archives

The synthetic data, which SandboxAQ is releasing publicly, can be used to train AI models that can predict whether a new drug molecule is likely to stick to the protein researchers are targeting.