Home Podcasts The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering
The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Fexingo 73 Episodes Jul 4, 2026

Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know the basics and want to debate the hard edges: Is 99.999% uptime always worth the cost? When should you deliberately degrade service to improve reliability? How do you design for resilience when your s

Episodes

How SRE Teams Use Error Budgets to Balance Reliability and Velocity Jul 4, 2026 11:49 In this episode, Lucas and Luna dive into the concept of error budgets—a cornerstone of Site Reliability Engineering that defines how much unreliability a team can tolerate while still meeting their Service Level Objectives. They explore how error budgets help SRE teams make data-driven trade-offs between shipping new features and maintaining system stability. Using examples from Google's original
How SRE Teams Use Incident Metrics to Improve Response Jul 3, 2026 9:41 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the world of incident metrics — not just DORA or SLOs, but the specific numbers that help SRE teams get faster and better at incident response. They discuss mean time to acknowledge, mean time to resolve, and the controversial metric of mean time between failures, using real examples from a major cloud provider's 2023 outage
How SRE Teams Use Cost Optimization to Reduce Cloud Waste Jul 3, 2026 8:36 Episode 88 of The Site Reliability Podcast with Fexingo dives into how SRE teams can cut cloud costs without sacrificing reliability. Lucas and Luna discuss the rise of FinOps, the hidden waste in over-provisioned resources, and how Google, Netflix, and Airbnb use committed use discounts, spot instances, and right-sizing to save millions. Learn the concrete metrics—like cost per transaction and id
How SRE Teams Use Toil Budgets to Protect Engineering Time Jul 2, 2026 11:31 Episode 87 of The Site Reliability Podcast explores toil budgets — a practice Google SRE pioneered to cap repetitive, non-valuable operational work. Lucas and Luna break down why Google set a 50% toil limit, how to measure toil versus engineering, and why companies like Etsy and Netflix use toil budgets to protect innovation time. They also discuss common pitfalls: treating all toil equally and fo
How SRE Teams Use Structured Fails to Learn Faster Jul 2, 2026 10:56 In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams deliberately inject small, controlled failures into production not to break things but to build collective learning. They dissect the approach used by a major payments company that runs weekly 'structured fail' exercises where engineers intentionally trigger a known-category incident (latency spike, partial data
How SRE Teams Use Post-Incident Reviews for System Improvements Jul 1, 2026 8:47 In Episode 85 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams turn post-incident reviews into actionable system improvements. They focus on a real-world case: a major streaming service's 2023 outage caused by a cascading failure in their content delivery network. The hosts break down the review process, from timeline reconstruction to root cause analysis to implementing preve
How SRE Teams Use Capacity Planning to Prevent Outages Jul 1, 2026 7:39 In this episode of The Site Reliability Podcast, Lucas and Luna dive into capacity planning for SRE teams — the proactive discipline that keeps systems running when traffic spikes. Using the example of a major streaming platform's 2024 holiday season surge, they break down how capacity planning differs from simple scaling, why it's part of reliability engineering, and how teams use traffic forecas
How SRE Teams Use Chaos Engineering to Build Resilient Systems Jun 30, 2026 11:45 Lucas and Luna dive into chaos engineering, using Netflix's Chaos Monkey and the Simian Army as the prime example. Lucas explains how Netflix intentionally broke its own systems in production to uncover weaknesses before they caused real outages, citing the tool's origin story from 2011 and its evolution into a formal discipline. Luna challenges the notion that chaos experiments are too risky for
How SRE Teams Use Cost of Delay to Prioritize Reliability Work Jun 30, 2026 12:48 Episode 82 of The Site Reliability Podcast examines how cost of delay — a concept borrowed from product development — helps SRE teams decide which reliability projects to tackle first. Lucas and Luna walk through a real example from a mid-sized fintech company that used cost of delay to justify migrating from a legacy database to a distributed SQL solution. The episode explains how to calculate co
How SRE Teams Use Latency Budgets to Meet Performance SLOs Jun 29, 2026 9:06 Lucas and Luna dive into latency budgets — a less-discussed SRE tool that maps acceptable delay across each microservice in a user request chain. They use the example of a social media app's photo upload feature: if the overall latency SLO is 500 milliseconds, the team allocates 50 ms to the auth service, 200 ms to the image processing service, and so on. Lucas explains how Google's internal SRE t
How SRE Teams Use Runbooks to Streamline Incident Response Jun 29, 2026 13:37 In episode 80 of The Site Reliability Podcast, Lucas and Luna dive into the practical world of runbooks — the step-by-step guides that SRE teams use to respond to incidents faster and more consistently. They explore how runbooks reduce cognitive load during high-stress outages, why documenting the 'why' behind each step prevents dangerous cargo-culting, and how a major streaming service cut its me
How SRE Teams Use Observability to Reduce Mean Time to Detect Jun 28, 2026 8:56 Episode 79 of The Site Reliability Podcast looks at how modern SRE teams are using observability tools to shrink mean time to detect — the gap between a system failure and the team knowing about it. Hosts Lucas and Luna break down why observability goes beyond traditional monitoring, using real-world examples like a major e-commerce platform that cut MTTD from 12 minutes to under 90 seconds by shi

Recommended