The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

How SRE Teams Use Error Budgets to Balance Reliability and Velocity Jul 4, 2026 11:49 In this episode, Lucas and Luna dive into the concept of error budgets—a cornerstone of Site Reliability Engineering that defines how much unreliability a team can tolerate while still meeting their Service Level Objectives. They explore how error budgets help SRE teams make data-driven trade-offs between shipping new features and maintaining system stability. Using examples from Google's original

How SRE Teams Use Incident Metrics to Improve Response Jul 3, 2026 9:41 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the world of incident metrics — not just DORA or SLOs, but the specific numbers that help SRE teams get faster and better at incident response. They discuss mean time to acknowledge, mean time to resolve, and the controversial metric of mean time between failures, using real examples from a major cloud provider's 2023 outage

How SRE Teams Use Cost Optimization to Reduce Cloud Waste Jul 3, 2026 8:36 Episode 88 of The Site Reliability Podcast with Fexingo dives into how SRE teams can cut cloud costs without sacrificing reliability. Lucas and Luna discuss the rise of FinOps, the hidden waste in over-provisioned resources, and how Google, Netflix, and Airbnb use committed use discounts, spot instances, and right-sizing to save millions. Learn the concrete metrics—like cost per transaction and id

How SRE Teams Use Toil Budgets to Protect Engineering Time Jul 2, 2026 11:31 Episode 87 of The Site Reliability Podcast explores toil budgets — a practice Google SRE pioneered to cap repetitive, non-valuable operational work. Lucas and Luna break down why Google set a 50% toil limit, how to measure toil versus engineering, and why companies like Etsy and Netflix use toil budgets to protect innovation time. They also discuss common pitfalls: treating all toil equally and fo

How SRE Teams Use Structured Fails to Learn Faster Jul 2, 2026 10:56 In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams deliberately inject small, controlled failures into production not to break things but to build collective learning. They dissect the approach used by a major payments company that runs weekly 'structured fail' exercises where engineers intentionally trigger a known-category incident (latency spike, partial data

How SRE Teams Use Post-Incident Reviews for System Improvements Jul 1, 2026 8:47 In Episode 85 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams turn post-incident reviews into actionable system improvements. They focus on a real-world case: a major streaming service's 2023 outage caused by a cascading failure in their content delivery network. The hosts break down the review process, from timeline reconstruction to root cause analysis to implementing preve

How SRE Teams Use Capacity Planning to Prevent Outages Jul 1, 2026 7:39 In this episode of The Site Reliability Podcast, Lucas and Luna dive into capacity planning for SRE teams — the proactive discipline that keeps systems running when traffic spikes. Using the example of a major streaming platform's 2024 holiday season surge, they break down how capacity planning differs from simple scaling, why it's part of reliability engineering, and how teams use traffic forecas

How SRE Teams Use Chaos Engineering to Build Resilient Systems Jun 30, 2026 11:45 Lucas and Luna dive into chaos engineering, using Netflix's Chaos Monkey and the Simian Army as the prime example. Lucas explains how Netflix intentionally broke its own systems in production to uncover weaknesses before they caused real outages, citing the tool's origin story from 2011 and its evolution into a formal discipline. Luna challenges the notion that chaos experiments are too risky for

How SRE Teams Use Cost of Delay to Prioritize Reliability Work Jun 30, 2026 12:48 Episode 82 of The Site Reliability Podcast examines how cost of delay — a concept borrowed from product development — helps SRE teams decide which reliability projects to tackle first. Lucas and Luna walk through a real example from a mid-sized fintech company that used cost of delay to justify migrating from a legacy database to a distributed SQL solution. The episode explains how to calculate co

How SRE Teams Use Latency Budgets to Meet Performance SLOs Jun 29, 2026 9:06 Lucas and Luna dive into latency budgets — a less-discussed SRE tool that maps acceptable delay across each microservice in a user request chain. They use the example of a social media app's photo upload feature: if the overall latency SLO is 500 milliseconds, the team allocates 50 ms to the auth service, 200 ms to the image processing service, and so on. Lucas explains how Google's internal SRE t

How SRE Teams Use Runbooks to Streamline Incident Response Jun 29, 2026 13:37 In episode 80 of The Site Reliability Podcast, Lucas and Luna dive into the practical world of runbooks — the step-by-step guides that SRE teams use to respond to incidents faster and more consistently. They explore how runbooks reduce cognitive load during high-stress outages, why documenting the 'why' behind each step prevents dangerous cargo-culting, and how a major streaming service cut its me

How SRE Teams Use Observability to Reduce Mean Time to Detect Jun 28, 2026 8:56 Episode 79 of The Site Reliability Podcast looks at how modern SRE teams are using observability tools to shrink mean time to detect — the gap between a system failure and the team knowing about it. Hosts Lucas and Luna break down why observability goes beyond traditional monitoring, using real-world examples like a major e-commerce platform that cut MTTD from 12 minutes to under 90 seconds by shi

How SRE Teams Use Service Level Agreements to Set Expectations Jun 28, 2026 8:33 Lucas and Luna dive into the often-overlooked difference between Service Level Agreements (SLAs) and Service Level Objectives (SLOs) in site reliability engineering. They explore how SLAs are not just legal documents but critical tools for managing stakeholder expectations, using a real-world case from a major cloud provider. The episode explains the 99.9% vs 99.99% uptime debate, the cost implica

How SRE Teams Use Canary Deployments to Reduce Risk Jun 27, 2026 10:50 Episode 77 of The Site Reliability Podcast dives into canary deployments: rolling out code changes gradually to a small subset of users before a full release. Lucas and Luna explain how companies like Netflix and Etsy use canary analysis to catch regressions early, using real traffic and metrics. They walk through the mechanics: routing a fraction of traffic, comparing key SLOs like latency and er

How SRE Teams Use DORA Metrics to Measure DevOps Performance Jun 27, 2026 10:23 In this episode of The Site Reliability Podcast, Lucas and Luna dive into DORA metrics — the four key DevOps Research and Assessment measures that elite SRE teams use to quantify software delivery and operational performance. They break down each metric: deployment frequency, lead time for changes, mean time to restore (MTTR), and change failure rate. The hosts explain how Google's 2019 Accelerate

How SRE Teams Use Service Level Objectives to Drive Reliability Jun 26, 2026 10:53 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical use of Service Level Objectives (SLOs) in site reliability engineering. They discuss how a major European bank reduced pager fatigue by 40% by shifting from alert-based monitoring to SLO-based error budgets. Lucas explains the difference between SLIs, SLOs, and SLAs, and why measuring user-facing latency is mor

How SRE Teams Use Blameless Culture to Improve Incident Response Jun 26, 2026 8:26 In this episode of The Site Reliability Podcast, Lucas and Luna dive into how a blameless culture can actually improve incident response times and reduce recurrence. They explore a real case from a mid-size SaaS company that cut its mean time to resolution by 40 percent after adopting blameless postmortems. Lucas breaks down the psychological safety factors that make engineers more willing to shar

How SRE Teams Use Blameless Postmortems to Build Trust Jun 25, 2026 8:26 In Episode 73 of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems transform incident response culture. Using examples from a major e-commerce platform's 2024 database outage, they break down the difference between blame and accountability, explain why 'human error' is a shallow root cause, and share how one team cut repeat incidents by 40% just by rewiring their post-

How SRE Teams Use Fault Tree Analysis to Prevent Root Causes Jun 25, 2026 11:49 In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams apply fault tree analysis (FTA) from aerospace and nuclear engineering to reduce incident recurrence. Using a real 2025 outage at a major streaming platform where a cascading DNS failure took down services for 47 minutes, they break down the top-down logic of FTA, how it differs from postmortem 5 whys, and why te

How SRE Teams Use AI for Incident Triage and Root Cause Analysis Jun 24, 2026 11:02 Episode 71 of The Site Reliability Podcast with Fexingo dives into how SRE teams are applying large language models and AI assistants to accelerate incident triage and root cause analysis. Lucas and Luna examine a real case from a mid-sized e-commerce platform: after a database connection pool exhaustion caused a 14-minute partial outage, the on-call engineer used a locally-run AI tool to correlat

How SRE Teams Use Game Days to Test Incident Response Jun 24, 2026 6:55 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practice of game days — structured simulations where SRE teams deliberately inject failures to test their incident response and on-call processes. They discuss a real example from a major streaming platform that runs quarterly game days to validate runbooks and reduce mean time to resolve from 45 minutes to under 15. The

How SRE Teams Use Error Budgets to Balance Reliability and Velocity Jun 23, 2026 9:00 In this episode of The Site Reliability Podcast, Lucas and Luna explore how error budgets help SRE teams make data-driven trade-offs between reliability and feature velocity. Using Google’s original framework and a real-world example from a major e-commerce platform, they explain how setting a 99.9% SLO with a 0.1% error budget per quarter creates explicit room for innovation without risking catas

How SRE Teams Use Infrastructure as Code to Prevent Configuration Drift Jun 23, 2026 11:03 In Episode 68 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use infrastructure as code (IaC) to prevent configuration drift — the silent killer of production reliability. They break down a real incident at a mid-sized fintech company where a manual SSH change caused a partial outage, and how the team rebuilt their entire environment with Terraform and automated compliance c

How SRE Teams Use Incident Response Playbooks Jun 22, 2026 7:54 In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use incident response playbooks to standardize their reaction to common outages. They break down what makes a good playbook—specific, testable, and owned by a single team—using concrete examples like a Redis cluster failover and a database connection pool exhaustion. Lucas explains the difference between a playbo

How SRE Teams Use Readiness Checks to Prevent Bad Deployments Jun 22, 2026 8:07 Site reliability teams spend huge effort on monitoring and alerting—but some of the worst outages start the moment a deployment goes live. In this episode, Lucas and Luna break down how readiness checks, or health probes, act as the first line of defense against bad code reaching production. Using the example of a major Kubernetes rollout gone wrong at a large e-commerce company, they explain the

How SRE Teams Use Cost Attribution to Prioritize Reliability Work Jun 21, 2026 8:54 Episode 65 of The Site Reliability Podcast digs into a practical framework SRE teams use to tie infrastructure costs to specific services and teams. Lucas and Luna break down how cost attribution works, why it helps prioritize reliability investments, and a real example from a major streaming platform that saved millions by charging back observability costs to feature teams. Learn how to move from

How SRE Teams Use Toil Budgets to Automate Smarter Jun 21, 2026 7:57 In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams are using toil budgets to prioritize automation and reduce operational overhead. They dive into Google's original SRE definition of toil—manual, repetitive, non-value-added work—and explain how teams set toil budgets as a percentage of total engineering time, typically around 50 percent. The hosts discuss a real-

How SRE Teams Use Capacity Planning to Prevent Outages Jun 20, 2026 9:21 In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams are shifting from reactive scaling to proactive capacity planning. They dive into the story of a major streaming service that averted a disastrous holiday outage by using predictive models based on historical traffic patterns and user growth data. The hosts break down the key metrics—like requests per second, mem

How SRE Teams Use Capacity Planning to Prevent Outages Jun 20, 2026 9:21 In this episode of The Site Reliability Podcast, Lucas and Luna explore the art and science of capacity planning in SRE. They walk through a concrete case: how a major streaming platform used predictive modeling to avoid a holiday-season outage after underestimating user growth in a new market. Lucas breaks down the two main approaches — reactive vs. proactive planning — and explains why the best

SRE Teams Are Using Chaos Engineering to Test Resilience Jun 19, 2026 10:56 In Episode 61 of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into chaos engineering—the disciplined practice of intentionally injecting failures into production systems to uncover weaknesses before they cause real outages. They explore the origins at Netflix and the emergence of tools like Chaos Monkey, Litmus, and Gremlin. The hosts discuss how SRE teams at companies like Amazo

How SRE Teams Use Postmortem Action Items to Prevent Recurrence Jun 19, 2026 8:15 In Episode 60, Lucas and Luna dive into the most overlooked part of incident response: the postmortem action items that actually prevent the same outage from happening twice. They unpack a 2025 study from Google's SRE team that found 67% of postmortem action items are never completed, and explore why. Using concrete examples from a major AWS S3 outage and a Stripe payment-processing incident, they

How SRE Teams Use Incident Severity Classification to Prioritize Response Jun 18, 2026 9:15 Episode 59 of The Site Reliability Podcast explores how SRE teams classify incidents by severity to decide how fast to respond and who to page. Lucas and Luna break down real-world classification frameworks — from SEV-1 (service down, all hands on deck) to SEV-4 (minor hiccup, fix in the next sprint). They discuss why vague severity definitions lead to alert fatigue and slow response times, and ho

How SRE Teams Use Post-Incident Reviews as Learning Tools Jun 18, 2026 9:25 Episode 58 of The Site Reliability Podcast with Fexingo digs into post-incident reviews — not as blame sessions or compliance checkboxes, but as structured learning mechanisms. Lucas and Luna examine Google's seminal 2016 Titan key outage to illustrate how root cause analysis misses the point if teams don't ask 'why' five times. They discuss the difference between finding a 'root cause' and unders

How SRE Teams Use Cost of Delay to Prioritize Reliability Work Jun 17, 2026 9:43 Lucas and Luna explore how SRE teams at companies like Spotify and Etsy use 'cost of delay' — a concept borrowed from product management — to quantify the business impact of reliability work. Lucas explains the math behind deferring a reliability project, using a real-world example: a payment-processing team deciding whether to fix a latency issue or build a new feature. Luna pushes back on the di

How SRE Teams Reduce Incident Noise with Intelligent Alert Routing Jun 17, 2026 9:11 Episode 56 of The Site Reliability Podcast explores how SRE teams at companies like Airbnb and Etsy use intelligent alert routing to slash incident noise by over 60 percent. Lucas and Luna break down the evolution from on-call pagers to modern event-driven routing, explain how machine learning models classify alerts by severity and team ownership, and discuss the trade-off between routing accuracy

How SRE Teams Use Incident Cost Analysis to Prioritize Reliability Investments Jun 16, 2026 9:07 Episode 55 of The Site Reliability Podcast with Fexingo dives into incident cost analysis — a growing practice at companies like Google and Stripe where SRE teams assign a dollar value to every outage minute. Lucas and Luna break down the methodology: how to quantify direct revenue loss, reputational damage, and opportunity cost from incidents, and how that data helps teams justify automation spen

How SRE Teams Use On-Call Compensation to Prevent Burnout Jun 16, 2026 8:37 Most SRE teams talk about incident response and automation, but fewer talk about the human side of on-call: how to pay people fairly for the disruption. Lucas and Luna dig into a 2025 survey of 500 SREs that found 62% feel on-call pay does not match the cognitive load. They compare models — flat stipend versus per-incident pay — and discuss how companies like Honeycomb and PagerDuty structure thei

SRE Teams Use SLO Burn Rate Alerts to Detect Incidents Faster Jun 15, 2026 9:09 Site reliability engineering has a well-known failure mode: your pager goes off at 2 AM for a minor blip, or worse, you don't get paged until a full-blown outage has already hit users. This episode explains SLO burn rate alerts — a concept that Google's SRE team refined in their 2016 book and which is now baked into tools like Google Cloud Monitoring, Datadog, and Grafana. Lucas and Luna walk thro

How SRE Teams Use Software Bill of Materials for Supply Chain Security Jun 15, 2026 9:43 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the growing importance of the Software Bill of Materials (SBOM) for securing software supply chains. They use the 2024 XZ Utils backdoor as a concrete case study to explain how a single maintainer burnout led to a critical vulnerability that an SBOM could have caught earlier. Lucas breaks down what an SBOM is, how it works w

How SRE Teams Use Feature Flags to Reduce Deployment Risk Jun 14, 2026 9:37 In Episode 51 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use feature flags—not just for canary releases, but as a core tool to decouple deployment from release, reduce blast radius, and enable instant rollback without redeploying. They walk through a real incident at a major streaming company where a misconfigured flag caused a 47-minute partial outage, and how the team

How SRE Teams Use Stress Testing to Simulate Real Workloads Jun 14, 2026 11:19 Lucas and Luna explore how production stress testing goes beyond standard load testing to simulate realistic user behavior, with a deep dive into how a major streaming platform used session replay and gradual ramp-up to validate infrastructure before a global event. They unpack why stress testing must replicate authentication flows, API call patterns, and edge case traffic shapes — not just raw re

How SRE Teams Use Game Days to Build Incident Muscle Memory Jun 13, 2026 8:46 Lucas and Luna explore how site reliability engineering teams use game days — structured, simulated incident exercises — to prepare for real outages. They break down the approach used by a major fintech company that runs quarterly game days for its entire on-call rotation, with concrete scenarios like a simulated database failover and a DNS misconfiguration. The episode covers how game days differ

How SRE Teams Use Error Budgets to Align Risk and Velocity Jun 13, 2026 8:48 In episode 48 of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into error budgets — the SRE concept that turns reliability into a business decision rather than a purely technical one. They break down how Google originally defined error budgets via the Service Level Indicator (SLI) / Service Level Objective (SLO) / error budget framework, then explore how teams at companies like Sh

How SRE Teams Use SLIs to Define Reliability Jun 12, 2026 7:12 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the often-overlooked first step of SRE practice: defining Service Level Indicators (SLIs). They explore how vague uptime percentages fail to capture user experience and walk through a concrete example from a major streaming platform that shifted from a 'five nines' target to a more granular SLI based on video start latency.

How SRE Teams Use Cognitive Load Management to Prevent Burnout Jun 12, 2026 9:47 Episode 46 of The Site Reliability Podcast with Fexingo dives into how SRE teams are applying cognitive load theory to reduce burnout and improve incident response. Lucas and Luna explore the concept of 'cognitive load' — the mental effort required to operate complex systems — and how teams at companies like Google and Netflix use techniques like toil reduction, documentation, and team topologies

How SRE Teams Use Observability to Find Unknown Unknowns Jun 11, 2026 10:06 Episode 45 of The Site Reliability Podcast digs into observability—how modern SRE teams go beyond monitoring to discover the 'unknown unknowns' that cause the worst outages. Lucas and Luna break down the difference between watching known metrics (CPU, memory) and exploring unknown failure modes with structured events and high-cardinality data. They walk through a real example: a major e-commerce p

How SRE Teams Use Dependency Graphs to Predict Outages Jun 11, 2026 7:50 In this episode of The Site Reliability Podcast, hosts Lucas and Luna explore how SRE teams at major tech companies build and maintain dependency graphs to predict cascading failures before they happen. Using concrete examples from cloud infrastructure and microservices architectures, they explain how graph-based service maps help teams identify single points of failure, model blast radius, and pr

How SRE Teams Use toil budgets to prioritize automation Jun 10, 2026 9:03 Episode 43 of The Site Reliability Podcast. Lucas and Luna explore how SRE teams are adopting 'toil budgets' — a concept inspired by error budgets — to cap the amount of manual, repetitive work engineers do each sprint. They break down Google's internal definition of toil (hands-on work with no enduring value), how a toil budget works alongside an error budget, and a concrete case from a mid-sized

How SRE Teams Use Service Level Objectives to Drive Daily Decisions Jun 10, 2026 8:54 This episode explores how Site Reliability Engineering teams use Service Level Objectives (SLOs) not just as a quarterly dashboard metric, but as a real-time decision-making tool that shapes pager rotations, deployment gating, and incident prioritization. Lucas walks through how Shopify's SRE team used a 99.95% availability SLO to flag a critical degradations before it became a full outage in 2025

How SRE Teams Use Canary Deployments to Reduce Release Risk Jun 9, 2026 8:32 Lucas and Luna dive into canary deployments: the practice of routing a small percentage of production traffic to a new version before rolling it out broadly. Lucas explains why Netflix's 'canary clusters' and Etsy's 'feature flipping' approach revolutionized how SRE teams think about release risk, and contrasts it with the old all-at-once deploys that caused major incidents. They discuss specific

How SRE Teams Use Chaos Engineering to Test Resilience Jun 9, 2026 10:50 In episode 40 of The Site Reliability Podcast, Lucas and Luna dive into chaos engineering — the practice of intentionally breaking systems to find weaknesses before real incidents strike. They explore how Netflix pioneered the approach with Chaos Monkey, the lessons SRE teams can learn from controlled failure experiments, and how to start small with simple game days that simulate a database partit

How SRE Teams Use Capacity Planning to Prevent Outages Jun 8, 2026 10:19 Episode 39 of The Site Reliability Podcast with Fexingo dives into capacity planning as a proactive SRE practice. Lucas and Luna explore how teams at companies like Google and Netflix use trend analysis, load testing, and headroom budgeting to avoid capacity-related outages. They discuss a real-world case from 2025 where a major streaming service averted a Super Bowl crash by scaling capacity week

How SRE Teams Use Immutable Infrastructure to Eliminate Configuration Drift Jun 8, 2026 9:18 In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use immutable infrastructure to eliminate configuration drift and improve reliability. They dive into a real case from Google's Borg paper, explaining how replacing mutable servers with golden images reduces incident rates and recovery times. The hosts break down the trade-offs with mutable servers, the role of i

How SRE Teams Use Auto-Remediation to Resolve Incidents Without Humans Jun 7, 2026 12:29 In this episode of The Site Reliability Podcast with Fexingo, Lucas and Luna explore how SRE teams are using auto-remediation to automatically resolve incidents without human intervention. They break down the anatomy of an auto-remediation pipeline — from monitoring alerts to automated runbook execution — using real-world examples like a major streaming service that reduced pager fatigue by 40 per

How SRE Teams Use Incident Command Systems to Coordinate Response Jun 7, 2026 9:34 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the incident command system (ICS) model that large-scale SRE teams borrow from emergency services to manage complex outages. They walk through a real example: a major payment processing incident at a fintech company where a database migration triggered a cascading failure affecting three million users. Lucas explains the fou

How SRE Teams Use Blameless Postmortems to Build Better Systems Jun 6, 2026 8:58 In this episode of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems go beyond simple incident analysis to drive real systemic improvements. Using the example of a major payment processor incident in early 2026, they break down the anatomy of an effective blameless postmortem: separating human error from system design flaws, writing actionable recommendations, and trac

How SRE Teams Use Postmortems That Actually Change Behavior Jun 6, 2026 8:17 In this episode of The Site Reliability Podcast, Lucas and Luna dig into the one incident-documentation practice most teams get wrong: the postmortem. Most postmortems are filed and forgotten. Lucas walks through how Google's SRE team shifted from blame-free to action-oriented postmortems, using a concrete example from their own 2017 Gmail outage. He breaks down the difference between a cause and

How SRE Teams Use Runbook Automation to Reduce Human Error Jun 5, 2026 8:14 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical side of runbook automation — moving beyond static documentation to executable, automated responses. They explore how companies like Google and Netflix use runbook automation to reduce mean time to repair by up to 60%, and discuss the common pitfalls: over-automation, stale runbooks, and the tension between spee

How SRE Teams Use Cost Optimization to Balance Performance and Budget Jun 5, 2026 6:48 In this episode of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into the often-overlooked intersection of site reliability engineering and cloud cost optimization. They explore how SRE teams at companies like Uber and Airbnb use techniques such as right-sizing instances, leveraging spot instances, and implementing autoscaling policies to reduce infrastructure spend without sacrif

How SRE Teams Use Load Shedding to Survive Traffic Spikes Jun 4, 2026 9:51 When a massive traffic spike hits, every millisecond of latency can cost thousands of dollars. In this episode, Lucas and Luna explore load shedding — the SRE technique of intentionally dropping non-critical requests to keep core systems running. They walk through how Google SREs used load shedding during the 2020 YouTube outage, how Stripe applies graceful degradation during payment surges, and w

How SRE Teams Use Feature Flags to Reduce Incident Risk Jun 4, 2026 11:00 Feature flags are a powerful tool for SREs, but they come with their own operational risks. In this episode, Lucas and Luna explore how companies like Etsy, Netflix, and LaunchDarkly use feature flags to decouple deployment from release, enabling canary rollouts, instant kill switches, and safer experimentation. They break down the difference between boolean flags, multivariate flags, and experime

How SRE Teams Use Incident Metrics to Reduce Mean Time to Resolve Jun 3, 2026 6:38 In episode 29 of The Site Reliability Podcast, Lucas and Luna dive into the specific metrics SRE teams use to reduce mean time to resolve (MTTR) during incidents. They break down the difference between mean time to acknowledge (MTTA) and MTTR, using real-world examples from companies like Google and Etsy. Lucas explains the concept of a 'rescue time' target—a hard limit on how long an incident can

How Cloud SREs Use Circuit Breakers to Prevent Cascading Failures Jun 3, 2026 14:03 When a single service fails, the whole system shouldn't collapse. In this episode, Lucas and Luna dive into the circuit breaker pattern — a critical resilience tool in site reliability engineering. They break down how Netflix's Hystrix inspired modern implementations, how companies like Amazon and Lyft use circuit breakers to isolate failures, and why a poorly tuned breaker can make an outage wors

How SREs Use Error Budgets to Balance Reliability and Velocity Jun 2, 2026 8:56 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical mechanics of error budgets — the SRE tool that lets teams trade reliability for feature velocity without breaking trust. They walk through a real example: a team running a service with a 99.9% SLO that has 0.1% error budget per month, and what happens when they burn through it by week two. Lucas explains how Go

How SRE Teams Use Game Days to Build Muscle Memory for Incidents Jun 2, 2026 8:13 In Episode 26 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams run 'game days' — simulated incident exercises — to build muscle memory and reduce panic during real outages. They break down how Etsy, a pioneer in game days, structures its exercises using realistic scenarios, mini-game design, and post-mortem debriefs without blame. The hosts discuss the difference between chaos

How SRE Teams Use Error Budgets to Balance Reliability and Velocity Jun 1, 2026 8:07 In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use error budgets to make smart trade-offs between reliability and feature velocity. They break down the concept with concrete examples from Google's original SRE model, showing how a 99.99% uptime target translates to 52.6 minutes of allowed downtime per year. The hosts discuss how error budgets empower teams to

SRE Runbooks That Actually Get Followed Jun 1, 2026 11:02 Most SRE teams have runbooks. Few have runbooks that engineers actually use in the middle of an incident. Lucas and Luna dive into why the typical runbook fails — too long, too vague, or written for the person who already knows the system. They break down what Google's internal SRE teams do differently: five-sentence maximum per procedure, explicit decision trees, and a 'runbook owners' workflow t

How SRE Teams Use Observability to Reduce Mean Time to Acknowledge May 31, 2026 8:30 Mean time to acknowledge (MTTA) is the clock that starts when an alert fires and stops when an engineer clicks 'ack'. For most teams, that gap is the single biggest waste of incident response time. In this episode, Lucas and Luna examine how Airbnb's SRE team cut their MTTA from 12 minutes to under 90 seconds by redesigning alert routing and escalation policies. They walk through the three-tier sy

How SRE Teams Use Synthetic Monitoring to Catch Outages First May 31, 2026 11:02 Episode 22 of The Site Reliability Podcast explores synthetic monitoring — proactive testing that catches outages before real users feel them. Lucas and Luna break down how companies like Etsy and Twilio simulate user journeys from multiple locations every minute, generating tens of thousands of transactions daily to validate critical flows. They discuss the difference between synthetic and real-u

How SRE Teams Use Traffic Shadowing for Safe Testing May 30, 2026 11:11 In this episode of The Site Reliability Podcast, Lucas and Luna explore traffic shadowing: a technique that lets SRE teams test new services with live production traffic without affecting real users. They break down how GitHub used shadowing to validate a new caching layer without risking customer data, and how Stripe employs it to test payment processing changes safely. Lucas explains the differe

How SRE Teams Use Canary Deployments to Reduce Blast Radius May 30, 2026 10:33 In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practice of canary deployments—a key strategy for reducing blast radius in production. They break down how teams like Etsy and Netflix use phased rollouts to catch issues early, with specific numbers: Etsy's Deployinator halved deployment failures after adopting canaries, and Netflix's Spinnaker pipeline automatically ro

How SRE Teams Use Data to Predict Incidents Before They Happen May 29, 2026 7:49 Most incident response is reactive—you get paged, you triage, you fix. But a growing number of SRE teams are flipping the model: using historical data, machine learning, and anomaly detection to predict incidents before they actually impact users. In this episode, Lucas and Luna explore how companies like Google, Datadog, and a major European bank are deploying predictive SRE. They break down the

How SRE Teams Use Capacity Planning to Prevent Black Friday Outages May 29, 2026 8:45 In this episode, Lucas and Luna explore how site reliability engineering teams use capacity planning to avoid catastrophic outages during peak traffic events like Black Friday and Cyber Monday. They break down the specific methodology used by major e-commerce platforms, including the concept of 'headroom targets' and 'traffic shaping' — techniques that go beyond simple auto-scaling. Lucas explains

How SRE Teams Use Service Level Objectives to Drive Business Decisions May 28, 2026 10:46 Lucas and Luna explore how service level objectives (SLOs) have evolved from a technical metric into a strategic business tool. Using examples from Google, Etsy, and a mid-size fintech startup, they show how SLOs help SRE teams align with product managers, trade reliability for feature velocity, and communicate risk in terms executives understand. The episode drills into the concept of 'SLO-based

How SRE Teams Use Toil Budgets to Prioritise Automation May 28, 2026 6:57 Episode 16 of The Site Reliability Podcast explores toil budgets: the SRE practice of capping manual, repetitive work so teams have time for automation. Lucas and Luna break down how Google defined toil in its SRE book, how a mid-size fintech used a 50% toil budget to reduce incident response time, and why tracking toil by hand feels ironic. They discuss a concrete case where one team freed up 30

How SRE Teams Handle On-Call Burnout Without Burning Out May 27, 2026 13:04 Episode 15 of The Site Reliability Podcast with Fexingo dives into the human side of site reliability engineering: on-call burnout. Lucas and Luna explore how teams at companies like Etsy and Honeycomb use structured rotations, incident-free shifts, and proactive 'time to recover' metrics to keep engineers fresh. They break down specific data—like the effect of 12-hour versus 7-day rotations on al

How SRE Teams Use Chaos Engineering for Non-Netflix Systems May 27, 2026 8:33 Lucas and Luna explore how site reliability engineers adapt chaos engineering beyond Netflix's famous Simian Army. The episode focuses on a mid-size e-commerce company, BlinkMart, which used controlled failure injection to uncover a critical database replication bug that would have caused a 45-minute outage during Black Friday. Lucas explains the difference between literal chaos—randomly killing s

How Microsoft SREs Automate Capacity Planning at Cloud Scale May 26, 2026 10:59 Episode 13 of The Site Reliability Podcast explores how Microsoft's SRE teams automate capacity planning to keep Azure running smoothly despite unpredictable demand. Lucas and Luna break down the three-layer approach — demand forecasting, headroom management, and autoscaling — and walk through a real case where a retail giant's Black Friday traffic spike was absorbed without a single incident. The

How GitHub SREs Run Postmortems Without Blame May 26, 2026 9:07 Episode 12 of The Site Reliability Podcast with Fexingo digs into GitHub's postmortem culture — specifically how their SRE team runs incident reviews that actually prevent recurrence without destroying psychological safety. Lucas and Luna walk through the five-part structure GitHub uses, the distinction between 'blame' and 'accountability,' and why writing a timeline before identifying causes chan

How Cloudflare Handles 46 Million Requests Per Second With SRE May 25, 2026 7:25 In this episode of The Site Reliability Podcast, Lucas and Luna dive into how Cloudflare's SRE team manages to process over 46 million HTTP requests per second across its global edge network. They explore the concept of 'edge of network' infrastructure, the role of anycast routing in distributing load, and how the team uses automated canary deployments to catch failures before they impact customer

The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering

Episodes

Recommended