Chaos Engineering Pioneer Gremlin Launches Reliability Intelligence

11-08-2025

In a digital landscape increasingly shaped by rapid deployment and AI-assisted development, maintaining system reliability is becoming both more critical and more complex. Gremlin, a longtime leader in Chaos Engineering, is stepping into this challenge with the launch of Reliability Intelligence—a new AI-powered solution aimed at helping organizations proactively identify, analyze, and resolve reliability risks in real time.
The new product, announced today, combines automated fault injection, continuous resilience analysis, and integration with large language models (LLMs) through a proprietary Model Context Protocol (MCP) server. The result is a deeply integrated system that allows businesses to reduce downtime and improve performance across increasingly dynamic software stacks.
"The Gremlin team has been managing complex online systems for decades," said Kolton Andrus, CEO of Gremlin. "We know that you can't just throw LLMs at the hard engineering problems involved with building and maintaining business-critical systems. Reliability Intelligence will provide actionable recommendations based on a deep understanding of your systems architecture and its dependencies across various cloud providers and third-party services." AI-Powered Reliability, Grounded in Real Engineering
Gremlin's move comes as companies accelerate software deployment cycles with the help of AI. According to the latest DORA (DevOps Research and Assessment) report, teams are now shipping code to production 70% faster thanks to AI coding assistants. But with that speed comes risk: AI-generated code is often error-prone and difficult to debug, increasing the potential for outages.
Traditionally, practices like Chaos Engineering have offered a solution—but they require specialized expertise that's still relatively rare. Gremlin's answer is to lower the barrier to entry, making proactive reliability more accessible and automated.
Recent features like Reliability Scoring, Intelligent Health Checks, Dependency Discovery, and Executive Reporting have already moved the platform in this direction. With the addition of Reliability Intelligence, Gremlin is aiming to make proactive reliability a default, rather than an elite practice. Key Capabilities in the New Release Experiment Analysis : Automatically compares test outcomes to expected behavior using LLMs. It can detect anomalies, understand test context, and determine pass/fail status—previously a manual task.
: Automatically compares test outcomes to expected behavior using LLMs. It can detect anomalies, understand test context, and determine pass/fail status—previously a manual task. Recommended Remediation : After identifying a failure, the system offers engineers specific, actionable fixes drawn from a library of best practices and millions of past test results.
: After identifying a failure, the system offers engineers specific, actionable fixes drawn from a library of best practices and millions of past test results. MCP Server: Enables LLMs to query telemetry and trace data directly. Users can generate insights or build dashboards using plain language—bringing powerful observability tools to a wider set of users.
"In high-velocity environments, reliability can't be an afterthought," said Arul Martin, Director of Performance Engineering at Sephora. "Reliability Intelligence equips SRE and performance teams with deep, real-time insights from telemetry and trace data — enabling early detection of reliability regressions, faster root cause isolation, and proactive remediation without disrupting release velocity." A New Era of Reliability Engineering
As businesses increasingly rely on AI to accelerate development, the challenges associated with maintaining the health and performance of online systems have never been greater. Gremlin is positioning Reliability Intelligence as a critical piece of the modern SRE toolset, blending helpful AI guidance with the rigor of battle-tested engineering.
For modern teams navigating complex environments, the ability to test, understand, and improve system resilience continuously is no longer a luxury—it's a necessity that modern teams have accountability and keep the guardrails on.

Hashtags

Business

Finance

#MCP

#ReliabilityIntelligence

#LLMs

#ModelContextProtocol

#IntelligentHealthChecks

#Gremlin

#DORA

#DevOpsResearchandAssessment

#KoltonAndrus

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Int'l Business Times

11-08-2025

Int'l Business Times

Chaos Engineering Pioneer Gremlin Launches Reliability Intelligence

In a digital landscape increasingly shaped by rapid deployment and AI-assisted development, maintaining system reliability is becoming both more critical and more complex. Gremlin, a longtime leader in Chaos Engineering, is stepping into this challenge with the launch of Reliability Intelligence—a new AI-powered solution aimed at helping organizations proactively identify, analyze, and resolve reliability risks in real time. The new product, announced today, combines automated fault injection, continuous resilience analysis, and integration with large language models (LLMs) through a proprietary Model Context Protocol (MCP) server. The result is a deeply integrated system that allows businesses to reduce downtime and improve performance across increasingly dynamic software stacks. "The Gremlin team has been managing complex online systems for decades," said Kolton Andrus, CEO of Gremlin. "We know that you can't just throw LLMs at the hard engineering problems involved with building and maintaining business-critical systems. Reliability Intelligence will provide actionable recommendations based on a deep understanding of your systems architecture and its dependencies across various cloud providers and third-party services." AI-Powered Reliability, Grounded in Real Engineering Gremlin's move comes as companies accelerate software deployment cycles with the help of AI. According to the latest DORA (DevOps Research and Assessment) report, teams are now shipping code to production 70% faster thanks to AI coding assistants. But with that speed comes risk: AI-generated code is often error-prone and difficult to debug, increasing the potential for outages. Traditionally, practices like Chaos Engineering have offered a solution—but they require specialized expertise that's still relatively rare. Gremlin's answer is to lower the barrier to entry, making proactive reliability more accessible and automated. Recent features like Reliability Scoring, Intelligent Health Checks, Dependency Discovery, and Executive Reporting have already moved the platform in this direction. With the addition of Reliability Intelligence, Gremlin is aiming to make proactive reliability a default, rather than an elite practice. Key Capabilities in the New Release Experiment Analysis : Automatically compares test outcomes to expected behavior using LLMs. It can detect anomalies, understand test context, and determine pass/fail status—previously a manual task. : Automatically compares test outcomes to expected behavior using LLMs. It can detect anomalies, understand test context, and determine pass/fail status—previously a manual task. Recommended Remediation : After identifying a failure, the system offers engineers specific, actionable fixes drawn from a library of best practices and millions of past test results. : After identifying a failure, the system offers engineers specific, actionable fixes drawn from a library of best practices and millions of past test results. MCP Server: Enables LLMs to query telemetry and trace data directly. Users can generate insights or build dashboards using plain language—bringing powerful observability tools to a wider set of users. "In high-velocity environments, reliability can't be an afterthought," said Arul Martin, Director of Performance Engineering at Sephora. "Reliability Intelligence equips SRE and performance teams with deep, real-time insights from telemetry and trace data — enabling early detection of reliability regressions, faster root cause isolation, and proactive remediation without disrupting release velocity." A New Era of Reliability Engineering As businesses increasingly rely on AI to accelerate development, the challenges associated with maintaining the health and performance of online systems have never been greater. Gremlin is positioning Reliability Intelligence as a critical piece of the modern SRE toolset, blending helpful AI guidance with the rigor of battle-tested engineering. For modern teams navigating complex environments, the ability to test, understand, and improve system resilience continuously is no longer a luxury—it's a necessity that modern teams have accountability and keep the guardrails on.

4 Modern IT Startups on the Rise in 2025

Int'l Business Times

30-06-2025

Int'l Business Times

4 Modern IT Startups on the Rise in 2025

Several sectors are seeing a surge in promising startups in 2025, including AI, healthcare, fintech, and sustainable energy. In the world of software, we're seeing coding copilots like Cursor and trends like vibe coding take off and increase code velocity by upwards of 70%, according to the latest DORA report. So what does this mean for modern engineering teams that are responsible for the uptime and performance of these AI-driven applications? For managing and maintaining their costs? AI is moving at speeds no one could have anticipated, leaving the platform and engineering teams on the reactive. Below, we highlight 4 startups that are growing fast, improving reliability, providing AI guardrails, and keeping costs in check: Gremlin Gremlin We anticipate 2025 to be a big year for Gremlin. They burst onto the scene in late 2017, pioneering the cutting-edge discipline of Chaos Engineering, which involves running attacks and experiments on your own online systems in order to identify the weaknesses. As a practice, Chaos Engineering was embraced internally at places like Netflix and Amazon, and so it's not a surprise that Gremlin CEO and Founder Kolton Andrus is an alumnus from both of these companies. Being early to market, Gremlin decided to hyper-focus on product over the past few years, building out world-class enterprise features that enable modern teams to get the most out of their proactive reliability efforts. They've built out reliability scoring, intelligent health checks, dependency discovery, and executive reporting in order to ensure enterprises can run experiments safely and validate their efforts. With AI on the rise and off the leash, it's essential that modern engineering teams have the tools to safely run experiments and ensure that AI-driven development doesn't mean a sacrifice in reliability and performance. Finout Finout Finout has quickly become a preferred financial operations management solution for modern teams looking to track and optimize the costs of their spending. The Finout platform consolidates all of your cloud expenses across the major cloud providers and 3rd party SaaS services into one, centralized dashboard. Whether your company runs on AWS or Azure, whether it uses Datadog or Snowflake, the Finout platform offers native integrations to quickly consolidate and then visualize your spending data, and then analyze it with their unique virtual tagging. Finout is also the only solution on this list that does not charge users for their cost optimization service. This means that any money saved on AWS, generated by Finout's solution, is money that stays with the customer. Causely Causely Causely is a new player on the observability scene. The main problem their platform addresses is that modern teams are drowning in too many alerts and too much data coming from multiple observability solutions across open-source and 3rd party vendors. Their causal reasoning platform automatically pinpoints root causes in the endless sea of observability data, helping engineers avoid unnecessary manual effort. The company recently announced support for Grafana, so that engineers can instantly see the "why" behind performance issues in the context of their services, significantly cutting resolution time when there's an alert that needs to be addressed. Causely also plugs into Grafana Alertmanager, enriching existing alerts with real-time, continuously-updated root-cause intelligence. This AI-powered capability goes beyond sending alerts when something is wrong, getting deeper into where the problem originated and what to do next within the incident response workflow. The company has teased a future that removes humans from the loop and involves more automation when it comes to IT management. It's promising to see startups on the operations side coming up with creative and efficient solutions that keep pace with the pace of AI-driven development. Espresso AI launched in the summer of last year with $11Million in funding and a straightforward message: users can save up to 70% on their Snowflake bill with some help from AI. The CEO, Ben Lerner, worked on Google DeepMind, which is made up of scientists, engineers, ethicists, and more working to build the next generation of AI systems safely and responsibly. Their solution leverages advanced language models (LLMs) and machine learning algorithms to optimize code and reduce cloud compute costs automatically. According to the company, Espresso AI is like Kubernetes for Snowflake, where they can intelligently route queries across warehouses to increase utilization and cut costs. "Snowflake alone has $2 billion in annual revenue. If you look across data warehousing broadly, it's certainly hundreds of millions of dollars in revenue for us, and billions in potential savings for customers," said Lerner when the company launched in 2024. The software vendors that perform the best in this market are the ones that avoid selling hype and provide real value for their customers. In a sea of software, it can be hard for businesses to know what solutions to leverage and how to keep up. We believe these are four solutions worth considering seriously.

What a new AI protocol means for journalists

25-06-2025

What a new AI protocol means for journalists

Coding agents and the Model Context Protocol are reshaping journalism's digital toolkit, enabling small newsrooms to build capable tools, but also raising new questions about responsibility. There is a "rupture in journalism around AI" as media researcher David Caswell puts it. And this rift runs similarly through other sectors where people work with knowledge and words. Yet beyond the often heated debates about the usefulness and drawbacks of generative AI, which allow little room for nuance, there is bustling activity: Besides the large tech corporations, countless IT companies, startups, and individual enthusiasts are working on building an infrastructure for artificial intelligence (AI). Currently, artificial intelligence (AI) primarily refers to large language models (LLMs). Through this activity, two areas have emerged in recent months that carry significant implications for journalistic work. Methods and approaches that were previously only possible with great effort, steep learning curves, or high costs are now becoming accessible: programming and operating complex software applications. Through LLMs, journalists' digital toolkits are now directly connected to an entire hardware store with its huge variety of equipment. Coding agents Using a coding agent means software is semy autonomously writing other software itself. This approach is also called "vibe coding " — presumably because it involves a rather fluid, iterative, and dialogic way of working. The user engages with the oscillations of large language models — with LLMs, there's always an element of chance involved. Using a coding agent means software is semy autonomously writing other software itself Image: Sirijit Jongcharoenkulchai/Zoonar/picture alliance These tools enable something that previously posed difficulties for people who couldn't program themselves: Implementing their ideas in practice. Yes, you still need a basic understanding of digital technologies and software coding. But until recently, digital projects, whether designing a new website or creating small tools, usually required a human software developer. For example, to collect information via crowdsourcing or to automate the processing of recurring datasets. And for more complex projects, designers were also needed for the interface (User Interface, UI) and functionality (User Experience, UX). Coding agents now take on this work: From developing the UI, the structure, setting up a database, to publication (deployment). The resulting code "belongs" to you; everything is based on open-source software. This means prototypes and ideas can be further developed elsewhere. The tools are particularly suitable for web applications (web apps). For instance I have used coding agents to retrieve real-time data from German railways via their official OpenData interface (API), to build a tool for collecting, visualizing, and analyzing delay data. Or to quickly craft a streamlined interface that allowed me to easily read and search through an extensive archive of messenger data. Finally, from the idea of obtaining a simple tool to quickly create transcripts of long radio pieces, we built a tool that was completely developed by coding agents without human developers: DIVER summarizes podcasts and newsletters, analyzes them, and provides users with overview reports and recommendations for podcast episodes and newsletter issues. It should be clear: Data-sensitive applications intended for use in critical areas should still not be developed without the involvement of professional human programmers. However, it is also clear that the increase in performance of these tools will likely continue for some time. This means that even larger software projects will soon be implementable quickly and cost-effectively. Meanwhile, there are about two dozen providers whose products differ in nuances, approaches, and focus areas — a selection can be found at the end of the article. Almost all offer free entry or even a daily free budget that can be used. Typical prices for the first payment tier range from 15 to 20 euros per month. Google recently entered the ring with its "Firebase Studio;" LLM manufacturer Anthropic agreed to work with Apple on a coding agent for the Xcode programming language. And OpenAI recently purchased the development environment "Windsurf" for $3 billion, which can also be used as a coding agent. Model Context Protocol In spring 2025, a crucial component of the AI infrastructure took shape: The MCP protocol. The open "Model Context Protocol" was introduced by Anthropic, known for its Claude chat, in late 2024. OpenAI, Microsoft, and others have since joined. So what does it do? "Think of MCP like a USB-C port for AI applications," says the opening line on the standard specification website . Or to illustrate it differently: Those who have seen the film The Matrix might remember this scene: Trinity and Neo stand within the digital world of the Matrix in front of a helicopter. Neo asks her: Can you fly this? She answers: Not yet. A few seconds later, she has downloaded the corresponding skill and can pilot the helicopter. MCP reads the handbook to steer complicated software for you. MCP positions itself as an intermediate layer between existing software applications and data sources on one hand, and large language models on the other. This means programs like Photoshop, 3D software like Blender and Cinema 4D, and music programs like Ableton, which come with complex user interfaces and steep learning curves, can now be operated via language: Users describe via chat, spoken or written, what they need or want to change. The MCP then controls the user interface of the software. It has essentially read the manual and knows how to operate it. The result of its activity can then be observed immediately and, if necessary, tested. In this respect, the MCP enables something comparable to the WYSIWYG approach (what you see is what you get) in graphic programs, etc. This should be particularly interesting for data journalists. There is already an MCP solution for RStudio , an application for using "R" for statistical analysis. It can be controlled via prompts like "Load the dataset and create a scatterplot of X vs. Y with a trend line." It is only a matter of time before visualization tools like DataWrapper can be controlled via MCP via chat to create map and data visualizations. What does this mean for journalists? As often happens, such technological change presents a double-edged sword. On one hand, it empowers individuals and resource-poor newsrooms directly with working methods that were previously unaffordable: Extensive research, persistent monitoring, complex analyses, and sophisticated digital formats. Used skillfully and knowledgeably, this can significantly increase the quality of journalism. Major AI corporations see the future in digital employees and digital twins Image: Knut Niehus/Zoonar/picture alliance On the other hand, it's also clear that these new tools will be used to obscure and complicate research: For example, to create credible forgeries of digital content. And the major AI corporations see the future in digital employees and digital twins — digital representatives of a person. It's still unclear exactly how this will take shape. Amazon , for example, recently showed how an LLM agent can operate directly on webpages in a browser. And Google introduced the open A2A protocol , which is intended to regulate the communication and coordination of AI agents with each other, regardless of who manufactured them. What kind of relationship should journalism maintain with digital representatives? How are their "actions" to be evaluated? How can responsibility be clarified and assigned when concrete effects on people occur through the digital activity of a digital twin? These are all open questions. The new possibilities offered by programming agents and the simplification of operating complex software applications will help find answers to these. List of coding agents Magic Patterns (User Interface) V0 (for beginners, but still powerful) Lovable (beginners and advanced users) Replit Agent (beginners and advanced users) Windsurf (professional users) Lorenz Matzat is a journalist and software producer working on LLM tools for journalism and knowledge retention. In 2017, he co-founded the NGO AlgorithmWatch, where he led its Research & Development efforts until 2022.

Chaos Engineering Pioneer Gremlin Launches Reliability Intelligence

Hashtags

Try Our AI Features

Comments

Related Articles

Chaos Engineering Pioneer Gremlin Launches Reliability Intelligence

4 Modern IT Startups on the Rise in 2025

What a new AI protocol means for journalists

Get Started Now: Download the App