HomePodcastsThe Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering
The Site Reliability Podcast with Fexingo: SRE, Uptime, and Production Engineering
Fexingo73 EpisodesJul 4, 2026
Lucas and Luna cut through the noise around site reliability engineering to examine how real-world SRE teams balance uptime, incident response, and production change. Each episode takes a single concept — error budgets, toil automation, postmortem culture, capacity planning — and grounds it in a specific case: how a major streaming service reduced paging noise, how a payments platform rebuilt its incident command structure, or how a cloud provider manages multi-region failover. Lucas brings the numbers — latency percentiles, MTTR trends, SLO burn rates — while Luna pushes on the human and organizational trade-offs: What does a junior SRE need to know about on-call? How do you measure reliability without crushing innovation? Why do some blameless postmortems actually work? Together they treat SRE not as a certification topic but as a living practice, citing real outages, open-source tools, and engineering blogs. This show is for engineers, ops leads, and platform teams who already know the basics and want to debate the hard edges: Is 99.999% uptime always worth the cost? When should you deliberately degrade service to improve reliability? How do you design for resilience when your s
Episodes
How SRE Teams Use Error Budgets to Balance Reliability and VelocityJul 4, 202611:49In this episode, Lucas and Luna dive into the concept of error budgets—a cornerstone of Site Reliability Engineering that defines how much unreliability a team can tolerate while still meeting their Service Level Objectives. They explore how error budgets help SRE teams make data-driven trade-offs between shipping new features and maintaining system stability. Using examples from Google's original
How SRE Teams Use Incident Metrics to Improve ResponseJul 3, 20269:41In this episode of The Site Reliability Podcast, Lucas and Luna dive into the world of incident metrics — not just DORA or SLOs, but the specific numbers that help SRE teams get faster and better at incident response. They discuss mean time to acknowledge, mean time to resolve, and the controversial metric of mean time between failures, using real examples from a major cloud provider's 2023 outage
How SRE Teams Use Cost Optimization to Reduce Cloud WasteJul 3, 20268:36Episode 88 of The Site Reliability Podcast with Fexingo dives into how SRE teams can cut cloud costs without sacrificing reliability. Lucas and Luna discuss the rise of FinOps, the hidden waste in over-provisioned resources, and how Google, Netflix, and Airbnb use committed use discounts, spot instances, and right-sizing to save millions. Learn the concrete metrics—like cost per transaction and id
How SRE Teams Use Toil Budgets to Protect Engineering TimeJul 2, 202611:31Episode 87 of The Site Reliability Podcast explores toil budgets — a practice Google SRE pioneered to cap repetitive, non-valuable operational work. Lucas and Luna break down why Google set a 50% toil limit, how to measure toil versus engineering, and why companies like Etsy and Netflix use toil budgets to protect innovation time. They also discuss common pitfalls: treating all toil equally and fo
How SRE Teams Use Structured Fails to Learn FasterJul 2, 202610:56In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams deliberately inject small, controlled failures into production not to break things but to build collective learning. They dissect the approach used by a major payments company that runs weekly 'structured fail' exercises where engineers intentionally trigger a known-category incident (latency spike, partial data
How SRE Teams Use Post-Incident Reviews for System ImprovementsJul 1, 20268:47In Episode 85 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams turn post-incident reviews into actionable system improvements. They focus on a real-world case: a major streaming service's 2023 outage caused by a cascading failure in their content delivery network. The hosts break down the review process, from timeline reconstruction to root cause analysis to implementing preve
How SRE Teams Use Capacity Planning to Prevent OutagesJul 1, 20267:39In this episode of The Site Reliability Podcast, Lucas and Luna dive into capacity planning for SRE teams — the proactive discipline that keeps systems running when traffic spikes. Using the example of a major streaming platform's 2024 holiday season surge, they break down how capacity planning differs from simple scaling, why it's part of reliability engineering, and how teams use traffic forecas
How SRE Teams Use Chaos Engineering to Build Resilient SystemsJun 30, 202611:45Lucas and Luna dive into chaos engineering, using Netflix's Chaos Monkey and the Simian Army as the prime example. Lucas explains how Netflix intentionally broke its own systems in production to uncover weaknesses before they caused real outages, citing the tool's origin story from 2011 and its evolution into a formal discipline. Luna challenges the notion that chaos experiments are too risky for
How SRE Teams Use Cost of Delay to Prioritize Reliability WorkJun 30, 202612:48Episode 82 of The Site Reliability Podcast examines how cost of delay — a concept borrowed from product development — helps SRE teams decide which reliability projects to tackle first. Lucas and Luna walk through a real example from a mid-sized fintech company that used cost of delay to justify migrating from a legacy database to a distributed SQL solution. The episode explains how to calculate co
How SRE Teams Use Latency Budgets to Meet Performance SLOsJun 29, 20269:06Lucas and Luna dive into latency budgets — a less-discussed SRE tool that maps acceptable delay across each microservice in a user request chain. They use the example of a social media app's photo upload feature: if the overall latency SLO is 500 milliseconds, the team allocates 50 ms to the auth service, 200 ms to the image processing service, and so on. Lucas explains how Google's internal SRE t
How SRE Teams Use Runbooks to Streamline Incident ResponseJun 29, 202613:37In episode 80 of The Site Reliability Podcast, Lucas and Luna dive into the practical world of runbooks — the step-by-step guides that SRE teams use to respond to incidents faster and more consistently. They explore how runbooks reduce cognitive load during high-stress outages, why documenting the 'why' behind each step prevents dangerous cargo-culting, and how a major streaming service cut its me
How SRE Teams Use Observability to Reduce Mean Time to DetectJun 28, 20268:56Episode 79 of The Site Reliability Podcast looks at how modern SRE teams are using observability tools to shrink mean time to detect — the gap between a system failure and the team knowing about it. Hosts Lucas and Luna break down why observability goes beyond traditional monitoring, using real-world examples like a major e-commerce platform that cut MTTD from 12 minutes to under 90 seconds by shi
How SRE Teams Use Service Level Agreements to Set ExpectationsJun 28, 20268:33Lucas and Luna dive into the often-overlooked difference between Service Level Agreements (SLAs) and Service Level Objectives (SLOs) in site reliability engineering. They explore how SLAs are not just legal documents but critical tools for managing stakeholder expectations, using a real-world case from a major cloud provider. The episode explains the 99.9% vs 99.99% uptime debate, the cost implica
How SRE Teams Use Canary Deployments to Reduce RiskJun 27, 202610:50Episode 77 of The Site Reliability Podcast dives into canary deployments: rolling out code changes gradually to a small subset of users before a full release. Lucas and Luna explain how companies like Netflix and Etsy use canary analysis to catch regressions early, using real traffic and metrics. They walk through the mechanics: routing a fraction of traffic, comparing key SLOs like latency and er
How SRE Teams Use DORA Metrics to Measure DevOps PerformanceJun 27, 202610:23In this episode of The Site Reliability Podcast, Lucas and Luna dive into DORA metrics — the four key DevOps Research and Assessment measures that elite SRE teams use to quantify software delivery and operational performance. They break down each metric: deployment frequency, lead time for changes, mean time to restore (MTTR), and change failure rate. The hosts explain how Google's 2019 Accelerate
How SRE Teams Use Service Level Objectives to Drive ReliabilityJun 26, 202610:53In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical use of Service Level Objectives (SLOs) in site reliability engineering. They discuss how a major European bank reduced pager fatigue by 40% by shifting from alert-based monitoring to SLO-based error budgets. Lucas explains the difference between SLIs, SLOs, and SLAs, and why measuring user-facing latency is mor
How SRE Teams Use Blameless Culture to Improve Incident ResponseJun 26, 20268:26In this episode of The Site Reliability Podcast, Lucas and Luna dive into how a blameless culture can actually improve incident response times and reduce recurrence. They explore a real case from a mid-size SaaS company that cut its mean time to resolution by 40 percent after adopting blameless postmortems. Lucas breaks down the psychological safety factors that make engineers more willing to shar
How SRE Teams Use Blameless Postmortems to Build TrustJun 25, 20268:26In Episode 73 of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems transform incident response culture. Using examples from a major e-commerce platform's 2024 database outage, they break down the difference between blame and accountability, explain why 'human error' is a shallow root cause, and share how one team cut repeat incidents by 40% just by rewiring their post-
How SRE Teams Use Fault Tree Analysis to Prevent Root CausesJun 25, 202611:49In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams apply fault tree analysis (FTA) from aerospace and nuclear engineering to reduce incident recurrence. Using a real 2025 outage at a major streaming platform where a cascading DNS failure took down services for 47 minutes, they break down the top-down logic of FTA, how it differs from postmortem 5 whys, and why te
How SRE Teams Use AI for Incident Triage and Root Cause AnalysisJun 24, 202611:02Episode 71 of The Site Reliability Podcast with Fexingo dives into how SRE teams are applying large language models and AI assistants to accelerate incident triage and root cause analysis. Lucas and Luna examine a real case from a mid-sized e-commerce platform: after a database connection pool exhaustion caused a 14-minute partial outage, the on-call engineer used a locally-run AI tool to correlat
How SRE Teams Use Game Days to Test Incident ResponseJun 24, 20266:55In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practice of game days — structured simulations where SRE teams deliberately inject failures to test their incident response and on-call processes. They discuss a real example from a major streaming platform that runs quarterly game days to validate runbooks and reduce mean time to resolve from 45 minutes to under 15. The
How SRE Teams Use Error Budgets to Balance Reliability and VelocityJun 23, 20269:00In this episode of The Site Reliability Podcast, Lucas and Luna explore how error budgets help SRE teams make data-driven trade-offs between reliability and feature velocity. Using Google’s original framework and a real-world example from a major e-commerce platform, they explain how setting a 99.9% SLO with a 0.1% error budget per quarter creates explicit room for innovation without risking catas
How SRE Teams Use Infrastructure as Code to Prevent Configuration DriftJun 23, 202611:03In Episode 68 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use infrastructure as code (IaC) to prevent configuration drift — the silent killer of production reliability. They break down a real incident at a mid-sized fintech company where a manual SSH change caused a partial outage, and how the team rebuilt their entire environment with Terraform and automated compliance c
How SRE Teams Use Incident Response PlaybooksJun 22, 20267:54In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use incident response playbooks to standardize their reaction to common outages. They break down what makes a good playbook—specific, testable, and owned by a single team—using concrete examples like a Redis cluster failover and a database connection pool exhaustion. Lucas explains the difference between a playbo
How SRE Teams Use Readiness Checks to Prevent Bad DeploymentsJun 22, 20268:07Site reliability teams spend huge effort on monitoring and alerting—but some of the worst outages start the moment a deployment goes live. In this episode, Lucas and Luna break down how readiness checks, or health probes, act as the first line of defense against bad code reaching production. Using the example of a major Kubernetes rollout gone wrong at a large e-commerce company, they explain the
How SRE Teams Use Cost Attribution to Prioritize Reliability WorkJun 21, 20268:54Episode 65 of The Site Reliability Podcast digs into a practical framework SRE teams use to tie infrastructure costs to specific services and teams. Lucas and Luna break down how cost attribution works, why it helps prioritize reliability investments, and a real example from a major streaming platform that saved millions by charging back observability costs to feature teams. Learn how to move from
How SRE Teams Use Toil Budgets to Automate SmarterJun 21, 20267:57In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams are using toil budgets to prioritize automation and reduce operational overhead. They dive into Google's original SRE definition of toil—manual, repetitive, non-value-added work—and explain how teams set toil budgets as a percentage of total engineering time, typically around 50 percent. The hosts discuss a real-
How SRE Teams Use Capacity Planning to Prevent OutagesJun 20, 20269:21In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams are shifting from reactive scaling to proactive capacity planning. They dive into the story of a major streaming service that averted a disastrous holiday outage by using predictive models based on historical traffic patterns and user growth data. The hosts break down the key metrics—like requests per second, mem
How SRE Teams Use Capacity Planning to Prevent OutagesJun 20, 20269:21In this episode of The Site Reliability Podcast, Lucas and Luna explore the art and science of capacity planning in SRE. They walk through a concrete case: how a major streaming platform used predictive modeling to avoid a holiday-season outage after underestimating user growth in a new market. Lucas breaks down the two main approaches — reactive vs. proactive planning — and explains why the best
SRE Teams Are Using Chaos Engineering to Test ResilienceJun 19, 202610:56In Episode 61 of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into chaos engineering—the disciplined practice of intentionally injecting failures into production systems to uncover weaknesses before they cause real outages. They explore the origins at Netflix and the emergence of tools like Chaos Monkey, Litmus, and Gremlin. The hosts discuss how SRE teams at companies like Amazo
How SRE Teams Use Postmortem Action Items to Prevent RecurrenceJun 19, 20268:15In Episode 60, Lucas and Luna dive into the most overlooked part of incident response: the postmortem action items that actually prevent the same outage from happening twice. They unpack a 2025 study from Google's SRE team that found 67% of postmortem action items are never completed, and explore why. Using concrete examples from a major AWS S3 outage and a Stripe payment-processing incident, they
How SRE Teams Use Incident Severity Classification to Prioritize ResponseJun 18, 20269:15Episode 59 of The Site Reliability Podcast explores how SRE teams classify incidents by severity to decide how fast to respond and who to page. Lucas and Luna break down real-world classification frameworks — from SEV-1 (service down, all hands on deck) to SEV-4 (minor hiccup, fix in the next sprint). They discuss why vague severity definitions lead to alert fatigue and slow response times, and ho
How SRE Teams Use Post-Incident Reviews as Learning ToolsJun 18, 20269:25Episode 58 of The Site Reliability Podcast with Fexingo digs into post-incident reviews — not as blame sessions or compliance checkboxes, but as structured learning mechanisms. Lucas and Luna examine Google's seminal 2016 Titan key outage to illustrate how root cause analysis misses the point if teams don't ask 'why' five times. They discuss the difference between finding a 'root cause' and unders
How SRE Teams Use Cost of Delay to Prioritize Reliability WorkJun 17, 20269:43Lucas and Luna explore how SRE teams at companies like Spotify and Etsy use 'cost of delay' — a concept borrowed from product management — to quantify the business impact of reliability work. Lucas explains the math behind deferring a reliability project, using a real-world example: a payment-processing team deciding whether to fix a latency issue or build a new feature. Luna pushes back on the di
How SRE Teams Reduce Incident Noise with Intelligent Alert RoutingJun 17, 20269:11Episode 56 of The Site Reliability Podcast explores how SRE teams at companies like Airbnb and Etsy use intelligent alert routing to slash incident noise by over 60 percent. Lucas and Luna break down the evolution from on-call pagers to modern event-driven routing, explain how machine learning models classify alerts by severity and team ownership, and discuss the trade-off between routing accuracy
How SRE Teams Use Incident Cost Analysis to Prioritize Reliability InvestmentsJun 16, 20269:07Episode 55 of The Site Reliability Podcast with Fexingo dives into incident cost analysis — a growing practice at companies like Google and Stripe where SRE teams assign a dollar value to every outage minute. Lucas and Luna break down the methodology: how to quantify direct revenue loss, reputational damage, and opportunity cost from incidents, and how that data helps teams justify automation spen
How SRE Teams Use On-Call Compensation to Prevent BurnoutJun 16, 20268:37Most SRE teams talk about incident response and automation, but fewer talk about the human side of on-call: how to pay people fairly for the disruption. Lucas and Luna dig into a 2025 survey of 500 SREs that found 62% feel on-call pay does not match the cognitive load. They compare models — flat stipend versus per-incident pay — and discuss how companies like Honeycomb and PagerDuty structure thei
SRE Teams Use SLO Burn Rate Alerts to Detect Incidents FasterJun 15, 20269:09Site reliability engineering has a well-known failure mode: your pager goes off at 2 AM for a minor blip, or worse, you don't get paged until a full-blown outage has already hit users. This episode explains SLO burn rate alerts — a concept that Google's SRE team refined in their 2016 book and which is now baked into tools like Google Cloud Monitoring, Datadog, and Grafana. Lucas and Luna walk thro
How SRE Teams Use Software Bill of Materials for Supply Chain SecurityJun 15, 20269:43In this episode of The Site Reliability Podcast, Lucas and Luna dive into the growing importance of the Software Bill of Materials (SBOM) for securing software supply chains. They use the 2024 XZ Utils backdoor as a concrete case study to explain how a single maintainer burnout led to a critical vulnerability that an SBOM could have caught earlier. Lucas breaks down what an SBOM is, how it works w
How SRE Teams Use Feature Flags to Reduce Deployment RiskJun 14, 20269:37In Episode 51 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use feature flags—not just for canary releases, but as a core tool to decouple deployment from release, reduce blast radius, and enable instant rollback without redeploying. They walk through a real incident at a major streaming company where a misconfigured flag caused a 47-minute partial outage, and how the team
How SRE Teams Use Stress Testing to Simulate Real WorkloadsJun 14, 202611:19Lucas and Luna explore how production stress testing goes beyond standard load testing to simulate realistic user behavior, with a deep dive into how a major streaming platform used session replay and gradual ramp-up to validate infrastructure before a global event. They unpack why stress testing must replicate authentication flows, API call patterns, and edge case traffic shapes — not just raw re
How SRE Teams Use Game Days to Build Incident Muscle MemoryJun 13, 20268:46Lucas and Luna explore how site reliability engineering teams use game days — structured, simulated incident exercises — to prepare for real outages. They break down the approach used by a major fintech company that runs quarterly game days for its entire on-call rotation, with concrete scenarios like a simulated database failover and a DNS misconfiguration. The episode covers how game days differ
How SRE Teams Use Error Budgets to Align Risk and VelocityJun 13, 20268:48In episode 48 of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into error budgets — the SRE concept that turns reliability into a business decision rather than a purely technical one. They break down how Google originally defined error budgets via the Service Level Indicator (SLI) / Service Level Objective (SLO) / error budget framework, then explore how teams at companies like Sh
How SRE Teams Use SLIs to Define ReliabilityJun 12, 20267:12In this episode of The Site Reliability Podcast, Lucas and Luna dive into the often-overlooked first step of SRE practice: defining Service Level Indicators (SLIs). They explore how vague uptime percentages fail to capture user experience and walk through a concrete example from a major streaming platform that shifted from a 'five nines' target to a more granular SLI based on video start latency.
How SRE Teams Use Cognitive Load Management to Prevent BurnoutJun 12, 20269:47Episode 46 of The Site Reliability Podcast with Fexingo dives into how SRE teams are applying cognitive load theory to reduce burnout and improve incident response. Lucas and Luna explore the concept of 'cognitive load' — the mental effort required to operate complex systems — and how teams at companies like Google and Netflix use techniques like toil reduction, documentation, and team topologies
How SRE Teams Use Observability to Find Unknown UnknownsJun 11, 202610:06Episode 45 of The Site Reliability Podcast digs into observability—how modern SRE teams go beyond monitoring to discover the 'unknown unknowns' that cause the worst outages. Lucas and Luna break down the difference between watching known metrics (CPU, memory) and exploring unknown failure modes with structured events and high-cardinality data. They walk through a real example: a major e-commerce p
How SRE Teams Use Dependency Graphs to Predict OutagesJun 11, 20267:50In this episode of The Site Reliability Podcast, hosts Lucas and Luna explore how SRE teams at major tech companies build and maintain dependency graphs to predict cascading failures before they happen. Using concrete examples from cloud infrastructure and microservices architectures, they explain how graph-based service maps help teams identify single points of failure, model blast radius, and pr
How SRE Teams Use toil budgets to prioritize automationJun 10, 20269:03Episode 43 of The Site Reliability Podcast. Lucas and Luna explore how SRE teams are adopting 'toil budgets' — a concept inspired by error budgets — to cap the amount of manual, repetitive work engineers do each sprint. They break down Google's internal definition of toil (hands-on work with no enduring value), how a toil budget works alongside an error budget, and a concrete case from a mid-sized
How SRE Teams Use Service Level Objectives to Drive Daily DecisionsJun 10, 20268:54This episode explores how Site Reliability Engineering teams use Service Level Objectives (SLOs) not just as a quarterly dashboard metric, but as a real-time decision-making tool that shapes pager rotations, deployment gating, and incident prioritization. Lucas walks through how Shopify's SRE team used a 99.95% availability SLO to flag a critical degradations before it became a full outage in 2025
How SRE Teams Use Canary Deployments to Reduce Release RiskJun 9, 20268:32Lucas and Luna dive into canary deployments: the practice of routing a small percentage of production traffic to a new version before rolling it out broadly. Lucas explains why Netflix's 'canary clusters' and Etsy's 'feature flipping' approach revolutionized how SRE teams think about release risk, and contrasts it with the old all-at-once deploys that caused major incidents. They discuss specific
How SRE Teams Use Chaos Engineering to Test ResilienceJun 9, 202610:50In episode 40 of The Site Reliability Podcast, Lucas and Luna dive into chaos engineering — the practice of intentionally breaking systems to find weaknesses before real incidents strike. They explore how Netflix pioneered the approach with Chaos Monkey, the lessons SRE teams can learn from controlled failure experiments, and how to start small with simple game days that simulate a database partit
How SRE Teams Use Capacity Planning to Prevent OutagesJun 8, 202610:19Episode 39 of The Site Reliability Podcast with Fexingo dives into capacity planning as a proactive SRE practice. Lucas and Luna explore how teams at companies like Google and Netflix use trend analysis, load testing, and headroom budgeting to avoid capacity-related outages. They discuss a real-world case from 2025 where a major streaming service averted a Super Bowl crash by scaling capacity week
How SRE Teams Use Immutable Infrastructure to Eliminate Configuration DriftJun 8, 20269:18In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use immutable infrastructure to eliminate configuration drift and improve reliability. They dive into a real case from Google's Borg paper, explaining how replacing mutable servers with golden images reduces incident rates and recovery times. The hosts break down the trade-offs with mutable servers, the role of i
How SRE Teams Use Auto-Remediation to Resolve Incidents Without HumansJun 7, 202612:29In this episode of The Site Reliability Podcast with Fexingo, Lucas and Luna explore how SRE teams are using auto-remediation to automatically resolve incidents without human intervention. They break down the anatomy of an auto-remediation pipeline — from monitoring alerts to automated runbook execution — using real-world examples like a major streaming service that reduced pager fatigue by 40 per
How SRE Teams Use Incident Command Systems to Coordinate ResponseJun 7, 20269:34In this episode of The Site Reliability Podcast, Lucas and Luna dive into the incident command system (ICS) model that large-scale SRE teams borrow from emergency services to manage complex outages. They walk through a real example: a major payment processing incident at a fintech company where a database migration triggered a cascading failure affecting three million users. Lucas explains the fou
How SRE Teams Use Blameless Postmortems to Build Better SystemsJun 6, 20268:58In this episode of The Site Reliability Podcast, Lucas and Luna explore how blameless postmortems go beyond simple incident analysis to drive real systemic improvements. Using the example of a major payment processor incident in early 2026, they break down the anatomy of an effective blameless postmortem: separating human error from system design flaws, writing actionable recommendations, and trac
How SRE Teams Use Postmortems That Actually Change BehaviorJun 6, 20268:17In this episode of The Site Reliability Podcast, Lucas and Luna dig into the one incident-documentation practice most teams get wrong: the postmortem. Most postmortems are filed and forgotten. Lucas walks through how Google's SRE team shifted from blame-free to action-oriented postmortems, using a concrete example from their own 2017 Gmail outage. He breaks down the difference between a cause and
How SRE Teams Use Runbook Automation to Reduce Human ErrorJun 5, 20268:14In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical side of runbook automation — moving beyond static documentation to executable, automated responses. They explore how companies like Google and Netflix use runbook automation to reduce mean time to repair by up to 60%, and discuss the common pitfalls: over-automation, stale runbooks, and the tension between spee
How SRE Teams Use Cost Optimization to Balance Performance and BudgetJun 5, 20266:48In this episode of The Site Reliability Podcast with Fexingo, Lucas and Luna dive into the often-overlooked intersection of site reliability engineering and cloud cost optimization. They explore how SRE teams at companies like Uber and Airbnb use techniques such as right-sizing instances, leveraging spot instances, and implementing autoscaling policies to reduce infrastructure spend without sacrif
How SRE Teams Use Load Shedding to Survive Traffic SpikesJun 4, 20269:51When a massive traffic spike hits, every millisecond of latency can cost thousands of dollars. In this episode, Lucas and Luna explore load shedding — the SRE technique of intentionally dropping non-critical requests to keep core systems running. They walk through how Google SREs used load shedding during the 2020 YouTube outage, how Stripe applies graceful degradation during payment surges, and w
How SRE Teams Use Feature Flags to Reduce Incident RiskJun 4, 202611:00Feature flags are a powerful tool for SREs, but they come with their own operational risks. In this episode, Lucas and Luna explore how companies like Etsy, Netflix, and LaunchDarkly use feature flags to decouple deployment from release, enabling canary rollouts, instant kill switches, and safer experimentation. They break down the difference between boolean flags, multivariate flags, and experime
How SRE Teams Use Incident Metrics to Reduce Mean Time to ResolveJun 3, 20266:38In episode 29 of The Site Reliability Podcast, Lucas and Luna dive into the specific metrics SRE teams use to reduce mean time to resolve (MTTR) during incidents. They break down the difference between mean time to acknowledge (MTTA) and MTTR, using real-world examples from companies like Google and Etsy. Lucas explains the concept of a 'rescue time' target—a hard limit on how long an incident can
How Cloud SREs Use Circuit Breakers to Prevent Cascading FailuresJun 3, 202614:03When a single service fails, the whole system shouldn't collapse. In this episode, Lucas and Luna dive into the circuit breaker pattern — a critical resilience tool in site reliability engineering. They break down how Netflix's Hystrix inspired modern implementations, how companies like Amazon and Lyft use circuit breakers to isolate failures, and why a poorly tuned breaker can make an outage wors
How SREs Use Error Budgets to Balance Reliability and VelocityJun 2, 20268:56In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practical mechanics of error budgets — the SRE tool that lets teams trade reliability for feature velocity without breaking trust. They walk through a real example: a team running a service with a 99.9% SLO that has 0.1% error budget per month, and what happens when they burn through it by week two. Lucas explains how Go
How SRE Teams Use Game Days to Build Muscle Memory for IncidentsJun 2, 20268:13In Episode 26 of The Site Reliability Podcast, Lucas and Luna explore how SRE teams run 'game days' — simulated incident exercises — to build muscle memory and reduce panic during real outages. They break down how Etsy, a pioneer in game days, structures its exercises using realistic scenarios, mini-game design, and post-mortem debriefs without blame. The hosts discuss the difference between chaos
How SRE Teams Use Error Budgets to Balance Reliability and VelocityJun 1, 20268:07In this episode of The Site Reliability Podcast, Lucas and Luna explore how SRE teams use error budgets to make smart trade-offs between reliability and feature velocity. They break down the concept with concrete examples from Google's original SRE model, showing how a 99.99% uptime target translates to 52.6 minutes of allowed downtime per year. The hosts discuss how error budgets empower teams to
SRE Runbooks That Actually Get FollowedJun 1, 202611:02Most SRE teams have runbooks. Few have runbooks that engineers actually use in the middle of an incident. Lucas and Luna dive into why the typical runbook fails — too long, too vague, or written for the person who already knows the system. They break down what Google's internal SRE teams do differently: five-sentence maximum per procedure, explicit decision trees, and a 'runbook owners' workflow t
How SRE Teams Use Observability to Reduce Mean Time to AcknowledgeMay 31, 20268:30Mean time to acknowledge (MTTA) is the clock that starts when an alert fires and stops when an engineer clicks 'ack'. For most teams, that gap is the single biggest waste of incident response time. In this episode, Lucas and Luna examine how Airbnb's SRE team cut their MTTA from 12 minutes to under 90 seconds by redesigning alert routing and escalation policies. They walk through the three-tier sy
How SRE Teams Use Synthetic Monitoring to Catch Outages FirstMay 31, 202611:02Episode 22 of The Site Reliability Podcast explores synthetic monitoring — proactive testing that catches outages before real users feel them. Lucas and Luna break down how companies like Etsy and Twilio simulate user journeys from multiple locations every minute, generating tens of thousands of transactions daily to validate critical flows. They discuss the difference between synthetic and real-u
How SRE Teams Use Traffic Shadowing for Safe TestingMay 30, 202611:11In this episode of The Site Reliability Podcast, Lucas and Luna explore traffic shadowing: a technique that lets SRE teams test new services with live production traffic without affecting real users. They break down how GitHub used shadowing to validate a new caching layer without risking customer data, and how Stripe employs it to test payment processing changes safely. Lucas explains the differe
How SRE Teams Use Canary Deployments to Reduce Blast RadiusMay 30, 202610:33In this episode of The Site Reliability Podcast, Lucas and Luna dive into the practice of canary deployments—a key strategy for reducing blast radius in production. They break down how teams like Etsy and Netflix use phased rollouts to catch issues early, with specific numbers: Etsy's Deployinator halved deployment failures after adopting canaries, and Netflix's Spinnaker pipeline automatically ro
How SRE Teams Use Data to Predict Incidents Before They HappenMay 29, 20267:49Most incident response is reactive—you get paged, you triage, you fix. But a growing number of SRE teams are flipping the model: using historical data, machine learning, and anomaly detection to predict incidents before they actually impact users. In this episode, Lucas and Luna explore how companies like Google, Datadog, and a major European bank are deploying predictive SRE. They break down the
How SRE Teams Use Capacity Planning to Prevent Black Friday OutagesMay 29, 20268:45In this episode, Lucas and Luna explore how site reliability engineering teams use capacity planning to avoid catastrophic outages during peak traffic events like Black Friday and Cyber Monday. They break down the specific methodology used by major e-commerce platforms, including the concept of 'headroom targets' and 'traffic shaping' — techniques that go beyond simple auto-scaling. Lucas explains
How SRE Teams Use Service Level Objectives to Drive Business DecisionsMay 28, 202610:46Lucas and Luna explore how service level objectives (SLOs) have evolved from a technical metric into a strategic business tool. Using examples from Google, Etsy, and a mid-size fintech startup, they show how SLOs help SRE teams align with product managers, trade reliability for feature velocity, and communicate risk in terms executives understand. The episode drills into the concept of 'SLO-based
How SRE Teams Use Toil Budgets to Prioritise AutomationMay 28, 20266:57Episode 16 of The Site Reliability Podcast explores toil budgets: the SRE practice of capping manual, repetitive work so teams have time for automation. Lucas and Luna break down how Google defined toil in its SRE book, how a mid-size fintech used a 50% toil budget to reduce incident response time, and why tracking toil by hand feels ironic. They discuss a concrete case where one team freed up 30
How SRE Teams Handle On-Call Burnout Without Burning OutMay 27, 202613:04Episode 15 of The Site Reliability Podcast with Fexingo dives into the human side of site reliability engineering: on-call burnout. Lucas and Luna explore how teams at companies like Etsy and Honeycomb use structured rotations, incident-free shifts, and proactive 'time to recover' metrics to keep engineers fresh. They break down specific data—like the effect of 12-hour versus 7-day rotations on al
How SRE Teams Use Chaos Engineering for Non-Netflix SystemsMay 27, 20268:33Lucas and Luna explore how site reliability engineers adapt chaos engineering beyond Netflix's famous Simian Army. The episode focuses on a mid-size e-commerce company, BlinkMart, which used controlled failure injection to uncover a critical database replication bug that would have caused a 45-minute outage during Black Friday. Lucas explains the difference between literal chaos—randomly killing s
How Microsoft SREs Automate Capacity Planning at Cloud ScaleMay 26, 202610:59Episode 13 of The Site Reliability Podcast explores how Microsoft's SRE teams automate capacity planning to keep Azure running smoothly despite unpredictable demand. Lucas and Luna break down the three-layer approach — demand forecasting, headroom management, and autoscaling — and walk through a real case where a retail giant's Black Friday traffic spike was absorbed without a single incident. The
How GitHub SREs Run Postmortems Without BlameMay 26, 20269:07Episode 12 of The Site Reliability Podcast with Fexingo digs into GitHub's postmortem culture — specifically how their SRE team runs incident reviews that actually prevent recurrence without destroying psychological safety. Lucas and Luna walk through the five-part structure GitHub uses, the distinction between 'blame' and 'accountability,' and why writing a timeline before identifying causes chan
How Cloudflare Handles 46 Million Requests Per Second With SREMay 25, 20267:25In this episode of The Site Reliability Podcast, Lucas and Luna dive into how Cloudflare's SRE team manages to process over 46 million HTTP requests per second across its global edge network. They explore the concept of 'edge of network' infrastructure, the role of anycast routing in distributing load, and how the team uses automated canary deployments to catch failures before they impact customer