Latest news with #TritonInferenceServer

'Burn the boats': To stay at the bleeding edge, AI developers are trashing old tech fast

Business Insider

27-04-2025

Business
Business Insider

'Burn the boats': To stay at the bleeding edge, AI developers are trashing old tech fast

It's not uncommon for AI companies to fear that Nvidia will swoop in and make their work redundant. But when it happened to Tuhin Srivastava, he was perfectly calm. "This is the thing about AI — you gotta burn the boats," Srivastava, the cofounder of AI inference platform Baseten, told Business Insider. He hasn't burned his quite yet, but he's bought the kerosene. The story goes back to when DeepSeek took the AI world by storm at the beginning of this year. Srivastava and his team had been working with the model for weeks, but it was a struggle. The problem was a tangle of AI jargon, but essentially, inference, the computing process that happens when AI generates outputs, needed to be scaled up to quickly run these big, complicated, reasoning models. Multiple elements were hitting bottlenecks and slowing down delivery of the model responses, making it a lot less useful for Baseten's customers, who were clamoring for access to the model. Srivastava's company has access to Nvidia's H200 chips — the best, widely available chip that could handle the advanced model at the time — but Nvidia's inference platform was glitching. A software stack called Triton Inference Server was getting bogged down with all the inference required for DeepSeek's reasoning model R1, Srivastava said. So Baseten built their own, which they still use now. Then, in March, Jensen Huang took to the stage at the company's massive GTC conference and launched a new inference platform: Dynamo. Dynamo is open-source software that helps Nvidia chips handle the intensive inference used for reasoning models at scale. "It is essentially the operating system of an AI factory," Huang said onstage. "This was where the puck was going," Srivastava said. And Nvidia's arrival wasn't a surprise. When the juggernaut inevitably surpasses Baseten's equivalent platform, the small team will abandon what they built and switch, Srivastava said. He expects it will take a couple of months max. "Burn the boats." It's not just Nvidia making tools with its massive team and research and development budget to match. Machine learning is constantly evolving. Models get more complex and require more computing power and engineering genius to work at scale, and then they shrink again when those engineers find new efficiencies and the math changes. Researchers and developers are balancing cost, time, accuracy, and hardware inputs, and every change reshuffles the deck. "You cannot get married to a particular framework or a way of doing things," said Karl Mozurkewich, principal architect at cloud firm Valdi. "This is my favorite thing about AI," said Theo Brown, a YouTuber and developer whose company, Ping, builds AI software for other developers. "It makes these things that the industry has historically treated as super valuable and holy, and just makes them incredibly cheap and easy to throw away," he told BI. Browne spent the early years of his career coding for big companies like Twitch. When he saw a reason to start over on a coding project instead of building on top of it, he faced resistance, even when it would save time or money. Sunk cost fallacy reigned. "I had to learn that rather than waiting for them to say, 'No,' do it so fast they don't have the time to block you," Browne said. That's the mindset of many bleeding-edge builders in AI. It's also often what sets startups apart from large enterprises. Quinn Slack, CEO of AI coding platform Sourcegraph, frequently explains this to his customers when he meets with Fortune 500 companies that may have built their first AI round on shaky foundations. " I would say 80% of them get there in an hourlong meeting," he said. The firmer ground is up the stack Ben Miller, CEO of real estate investment platform Fundrise, is building an AI product for the industry, and he doesn't worry too much about the latest model. If a model works for its purpose, it works, and moving up to the latest innovation is unlikely to be worth the engineer's hours. "I'm sticking with what works well enough for as long as I can," he said. That's in part because Miller has a large organization, but it's also because he's building things farther up the stack. That stack consists of hardware at the bottom, usually Nvidia's GPUs, and then layers upon layers of software. Baseten is a few layers up from Nvidia. The AI models, like R1 and GPT-4o, are a few layers up from Baseten. And Miller is just about at the top where consumers are. "There's no guarantee you're going to grow your customer base or your revenue just because you're releasing the latest bleeding-edge feature," Mozurkewich said. "When you're in front of the end-user, there are diminishing returns to moving fast and breaking things."

Nvidia Dynamo — Next-Gen AI Inference Server For Enterprises

Forbes

25-03-2025

Business
Forbes

Nvidia Dynamo — Next-Gen AI Inference Server For Enterprises

Dynamo Inference Server At the GTC 2025 conference, Nvidia introduced Dynamo, a new open-source AI inference server designed to serve the latest generation of large AI models at scale. Dynamo is the successor to Nvidia's widely used Triton Inference Server and represents a strategic leap in Nvidia's AI stack. It is built to orchestrate AI model inference across massive GPU fleets with high efficiency, enabling what Nvidia calls AI factories to generate insights and responses faster and at a lower cost. This article attempts to provide a technical overview of Dynamo's architecture, features and the value it offers enterprises. At its core, Dynamo is a high-throughput, low-latency inference-serving framework for deploying generative AI and reasoning models in distributed environments. It integrates into Nvidia's full-stack AI platform as the operating system of AI factories, connecting advanced GPUs, networking, and software to enhance inference performance. Nvidia's CEO Jensen Huang emphasized Dynamo's significance by comparing it to the dynamos of the Industrial Revolution—a catalyst that converts one form of energy into another—except here, it converts raw GPU compute into valuable AI model outputs at an unparalleled scale. Dynamo aligns with Nvidia's strategy of providing end-to-end AI infrastructure. It has been built to complement Nvidia's new Blackwell GPU architecture and AI data center solutions. For example, Blackwell Ultra systems provide the immense compute and memory for AI reasoning, while Dynamo provides the intelligence to utilize those resources efficiently. Dynamo is fully open source, continuing Nvidia's open approach to AI software. It supports popular AI frameworks and inference engines, including PyTorch, SGLang, Nvidia's TensorRT-LLM and vLLM. This broad compatibility means enterprises and startups can adopt Dynamo without rebuilding their models from scratch. It seamlessly integrates with existing AI workflows. Major cloud and technology providers like AWS, Google Cloud, Microsoft Azure, Dell, Meta and others are already planning to integrate or support Dynamo, underscoring its strategic importance across the industry. Dynamo is designed from the ground up to serve the latest reasoning models, such as DeepSeek R1. Serving large LLMs and highly capable reasoning models efficiently requires new approaches beyond what earlier inference servers provided. Dynamo introduces several key innovations in its architecture to meet these needs: Dynamic GPU Planner: Dynamically adds or removes GPU workers based on real-time demand, preventing over-provisioning or underutilization of hardware. In practice, this means if user requests spike, Dynamo can temporarily allocate more GPUs to handle the load, then scale back, optimizing utilization and cost. LLM-Aware Smart Router: Intelligently routes incoming AI requests across a large GPU cluster to avoid redundant computations. It keeps track of what each GPU has in its knowledge cache (the part of memory storing recent model context) and sends each query to the GPU node best primed to handle it. This context-aware routing prevents repeatedly re-thinking the same content and frees up capacity for new requests. Low-Latency Communication Library (NIXL): Provides state-of-the-art, accelerated GPU-to-GPU data transfer and messaging, abstracting away the complexity of moving data across thousands of nodes. By reducing communication overhead and latency, this layer ensures that splitting work across many GPUs doesn't become a bottleneck. It works across different interconnects and networking setups, so enterprises can benefit whether they use ultra-fast NVLink, InfiniBand, or Ethernet clusters. Distributed Memory (KV) Manager: Offloads and reloads inference data (particularly 'keys and values' cache data from prior token generation) to lower-cost memory or storage tiers when appropriate. This means less critical data can reside in system memory or even on disk, cutting expensive GPU memory usage, yet be quickly retrieved when needed. The result is higher throughput and lower cost without impacting the user experience. Disaggregated serving: Traditional LLM serving would perform all inference steps (from processing the prompt to generating the response) on the same GPU or node, which often underutilized resources. Dynamo instead splits these stages into a prefill stage that interprets the input and a decode stage that produces the output tokens, which can run on different sets of GPUs. As AI reasoning models become mainstream, Dynamo represents a critical infrastructure layer for enterprises looking to deploy these capabilities efficiently. Dynamo revolutionizes inference economics by enhancing speed, scalability and affordability, allowing organizations to provide advanced AI experiences without a proportional rise in infrastructure costs. For CXOs prioritizing AI initiatives, Dynamo offers a pathway to both immediate operational efficiencies and longer-term strategic advantages in an increasingly AI-driven competitive landscape.

Latest news with #TritonInferenceServer

'Burn the boats': To stay at the bleeding edge, AI developers are trashing old tech fast

Nvidia Dynamo — Next-Gen AI Inference Server For Enterprises

Get Started Now: Download the App