Latest news with #SiteReliabilityEngineering


Korea Herald
16-05-2025
- Business
- Korea Herald
Huawei Cloud Credence Forum Singapore 2025: Enhancing Enterprise Quality and Efficiency through Cloud Resilience
SINGAPORE, May 16, 2025 /PRNewswire/ -- On May 16 2025, the Huawei Cloud Credence Forum Singapore 2025 successfully brought together over 30 global industry leaders and technology experts. The forum provided a platform for comprehensive discussions on building and operating resilient cloud services, fostering innovation and driving improvements in both quality and operational efficiency in the digital era. Industry Perspective: Navigating Enterprise Cloud Security in Multi-Cloud Environments Mr. Maxi Wang Chief Executive Officer Huawei International emphasised, "With the rapid development of technologies, traditional operations models must evolve to meet the dynamic needs of modern enterprises. Therefore, we must continuously innovate and explore together to adapt effectively to the new requirements of cloud infrastructure, just like today's topic: Accelerate innovation in cloud resilience. Earlier this year, the rise of DeepSeek ignited the adoption of enterprise intelligence across diverse industries, presenting new challenges to cloud infrastructure resilience. " Huawei Cloud's Deterministic Operations: Ensuring Security, Stability, and High Quality for Global Customers Alex An, Director of the SRE Dept, highlighted: "Huawei Cloud achieved zero major incidents in 2024, demonstrating industry-leading quality standards. Huawei Cloud has localized the Site Reliability Engineering (SRE) concept, creating a deterministic operations system. This comprehensive system integrates quality culture, high-availability architectural design, dynamic risk management, and intelligent operations, enabling cloud customers to consistently achieve predictable business outcomes. Utilizing global networks, risk management, and digital twin technologies, Huawei Cloud Security Services swiftly restore and enhance customer experiences. The launch of the Huawei Cloud Credence Forum Singapore marks a significant step towards its global expansion, promoting local exploration and application of intelligent operations." Huawei Cloud's Intelligence Service Practices: Delivering Business Certainty and High Availability Jayson Zheng SRE Director of Huawei Cloud Application Services, stated: "Huawei Cloud's enterprise intelligence services now support over 600 commercial customers. Our high-performance computing platforms facilitate efficient job scheduling for large-scale clusters, ensuring over 90% MFU and linear scalability performance, and meeting high availability needs. We provide high availability for cloud-based intelligence services, support process-level recovery of training tasks and PD separation of inference tasks. The operations and maintenance platform ensure an SLO exceeding 99.95%, MFU over 50%, and MTTR under 30 minutes, effectively helping customers optimise models and improve system throughput." Panel Discussion: Leveraging Cloud Resilience to Boost Development Agility and Operational Efficiency A dynamic panel session moderated by Mr. Evan Cheng, Senior Vice President of Huawei Cloud Continuous Operation & Delivery, featured prominent industry experts including Mr. Jim Lim, Vice President of the Cloud Security Alliance; Mr. Gan XingPing, CIO of NatSteel; Ms. Ariel Lin, Founder and Director of Flex-Solver; and Dr. Zhang Xi, Huawei Cloud AI Expert. The panelists provided valuable insights on how modern cloud resilience strategies can enhance enterprise agility and operational efficiency. Discussions covered critical topics such as cloud security standards, seamless intelligence integration into business processes to drive enterprise modernization, and real-world applications in intelligent operations and maintenance (O&M). The conversation emphasised intelligent O&M's pivotal role in boosting system reliability in today's digital era. The session concluded with a strong alignment on these pivotal themes, reinforcing a shared commitment to supporting sustainable growth and digital transformation initiatives. Official Launch of Huawei Cloud Credence Club Singapore: Promoting Industry Innovation and Development Huawei Cloud also officially launched Huawei Cloud Credence Club in Singapore, marking a major step in its commitment to digital transformation and operational excellence across industries. Attended by the initial cohort of Credence members and key business leaders from the region, Huawei Cloud Credence Club will join forces with leading local enterprises to explore emerging technology trends, drive innovation with operational technologies, share best practices, and foster progress in industry transformation and technological advancement. Huawei Cloud: Accelerating Innovations in Cloud Resilience Ms Gigi Hu, Managing Director of Huawei Cloud Singapore, concluded: " Leveraging extensive global experience and professional expertise, Huawei Cloud has established a deterministic operations and maintenance system to ensure customer businesses operate safely and stably amid complex and changing environments. The system significantly enhances their business quality and operational efficiency, empowering them to achieve their goals effectively." Looking ahead, Huawei Cloud aims to expand the Credence Club's role as a catalyst for innovation and collaboration. Engaging global experts and industry leaders, the club will focus on overcoming technological barriers, scaling intelligent operations, and elevating cloud productivity. The ultimate objective is to achieve reliable, intelligent operations, resource efficiency, and business agility, transforming O&M into a key enabler of an intelligent world.


India.com
16-05-2025
- Business
- India.com
Keeping the Cloud Steady: How Muthuraman Saminathan Masters the Art of Site Reliability Engineering
Site Reliability Engineering, or SRE, started at Google back in 2003. Not many people talked about it at the time it was kind of under the radar. Things changed when cloud computing took off. Suddenly, every company needed their systems to stay online all the time. Now, even a small issue, like one misstep in the setup or a delay no one sees, can cause major problems and cost a lot of money. SRE folks deal with that kind of stuff. It's not just about writing code, it's also about keeping things from breaking in the real world. With how things are today so many services, different cloud platforms, and rules to follow, the job's only gotten tougher. That's why having someone like Muthuraman Saminathan, who's seen and handled a lot, really helps. An Engineer at the Heart of Reliability Muthuraman Saminathan's route to SRE authority began in data-intensive financial services, moved through high-performance computing at global energy technology platforms. A master's degree in engineering gave him the algorithmic grounding; a career that hops confidently between Java, Go, Kubernetes, and three hyperscale clouds provided the battlefield testing. At Equifax, he migrated high-throughput pipelines to Google Cloud, trimming ingestion times by 15 percent while threading every request through GDPR and CCPA guardrails. 'When you own an employment-data feed measured in terabytes, there's no boutique outage there's only a headline outage,' he observes. Today, at a leading energy-technology provider with clusters on Azure and GCP, Saminathan leads a 24×7 follow-the-sun SRE team. He designed the multitenant control plane for the firm's HPC platform, monitors hundreds of Kubernetes nodes, and publishes monthly cost-consumption dossiers that have already shaved five percent from the global bill. 'I view every dashboard as an executive résumé of the system. It must speak plainly about saturation, error budgets, and spend even to someone who has never written a line of code,' he says. His breadth Kafka to JanusGraph, Spring Boot to Spark, means he can trace a performance glitch from API gateway down to storage IOPS, then automate the remedy in Terraform. Lessons from the Front Line Saminathan's approach rests on a trio of principles. First, observability before optimization: 'You cannot improve what you cannot see; I refuse to patch performance until I have end-to-end traces, logs, and metrics proving the real culprit,' he explains. This rigor paid dividends when he uncovered idle rules burning compute credits overnight, a quiet leak masked by normal weekday traffic patterns. Second, automation with empathy. His team scripts every repeatable fix, yet he insists on post-incident reviews that surface human factors handover gaps, alert fatigue, and ambiguous runbooks. 'The pipeline isn't just YAML and Bash; it's people interpreting symptoms at 2 a.m. Empathy tightens the feedback loop faster than any cron job,' he cautions. Third, cloud-agnostic design. Having deployed on AWS, GCP and Azure, he sees vendor APIs as interchangeable adapters, not permanent dependencies. During his Equifax tenure he abstracted authentication layers across OAuth2, IAM roles and service principals, enabling seamless failover between regions and providers. The same pattern now underpins his multicloud HPC fabric, where workloads shift toward the cheapest GPU hour without breaking compliance attestations or audit trails. Colleagues point to his knack for translating reliability math into business value. A redesigned ingestion pipeline might sound esoteric, but when it lifts data freshness guarantees from six hours to near real-time, it unlocks new credit-scoring products and faster investigative fraud workflows. Likewise, a five-percent infrastructure saving funds the next round of feature experiments. This product-centric mindset stems from an MBA-style certificate in Product Strategy at Kellogg, an unusual credential among SREs that nudges him to present error-budget policy in the language of revenue protection. India's Moment in Global Reliability SRE has always been that field where for engineers in India, what you know means more than how many people work in the team. Cloud consumption had always been on a steep rise throughout the country, yet experienced SREs were still few and far between. The journey of Muthuraman Saminathan clearly signifies: get the foundations in backend systems down; take the hardest challenges, in highly regulated industries; then get comfortable with multi-cloud. What you get in return is influence that can cut development and operations and is highly portable, sought after from Bengaluru fintech start-ups to Fortune 100 giants. As he puts it, 'India's technologists already run some of the world's largest payment rails; honing SRE discipline will let them run the world's most reliable ones.' When outages go to primetime and regulators start talking about uptime mandates, SRE stops being an afterthought and truly becomes the nervous system of digital business. This makes it so thatengineers like Muthuraman Saminathan become the neurosurgeons working in the background, ensuring every heartbeat reaches the clouds louder and is returned in a thunderous form.