llmstory
SRE Blameless Post-Mortem Exam
Cloud Service Outage: Post-Mortem Scenario

Incident Overview: Major Cloud Service Outage

On September 15, 2023, from approximately 14:00 UTC to 17:00 UTC, a critical user-facing cloud service experienced a severe outage, rendering it largely unavailable to users globally. The incident began with users reporting slow responses and intermittent 503 Service Unavailable errors. Within 15 minutes, the service became completely unresponsive for a significant percentage of users. Internal alerts were triggered for high error rates and increased latency across multiple dependent services.

Simulated Monitoring Data Analysis (14:00 UTC - 17:30 UTC):

  • Service Latency (User-facing and Internal API calls): At 14:00 UTC, latency for user-facing requests and internal API calls to the affected service began to climb sharply from an average of 50ms to over 5,000ms within 10 minutes. By 14:15 UTC, latency graphs flatlined for many endpoints, indicating complete unresponsiveness or timeouts. Post-restoration (around 17:00 UTC), latency returned to normal baselines gradually over 30 minutes.
  • Error Rates (HTTP 5xx errors): A dramatic surge in HTTP 503 and 504 errors was observed starting at 14:00 UTC, peaking at nearly 100% of all requests by 14:10 UTC. Other 5xx errors (e.g., 500) saw a minor, secondary increase but were not the primary error code. Error rates dropped sharply after 17:00 UTC.
  • Application Pod CPU/Memory Utilization: Prior to 14:00 UTC, CPU and memory utilization for the affected service's application pods were stable and within normal operating ranges. At 14:00 UTC, approximately 30% of pods in the primary deployment (specifically, my-app-api) began exhibiting erratic CPU spikes to 100%, followed by rapid restarts. Concurrently, the number of 'Ready' pods for this deployment declined from 50 to less than 5 within 20 minutes, with corresponding increases in 'Pending' and 'Terminating' states. The remaining 70% of pods showed increasing memory pressure but no consistent CPU spikes, likely due to queuing of requests.
  • Kubernetes Control Plane Metrics: At 14:00 UTC, a significant increase in kube-apiserver latency was observed, particularly for GET /pods and PUT /deployments requests, which jumped from <50ms to over 500ms. The kube-scheduler queue depth for my-app-api pods increased sharply, indicating delays in scheduling new pods. kube-controller-manager reconciliation errors for my-app-api deployment were also elevated, often reporting issues with ImagePullBackOff or CrashLoopBackOff for new pod attempts, despite images being present and accessible. Overall kubelet reported NotReady for a large number of my-app-api pods.
  • Database Connections/Query Latency: Database connection counts for the primary application database remained stable throughout the incident. Query latency showed a slight, secondary increase between 14:15 UTC and 16:00 UTC, but no critical saturation or direct failure was observed at the database layer. Database metrics returned to baseline as the application recovered.
  • DNS Query Latency: DNS resolution times for internal and external services remained consistently stable throughout the incident, with no anomalous spikes or failures reported by DNS monitoring systems.
  • Network I/O: While aggregate network traffic to the affected service dropped as users were unable to connect, core network infrastructure metrics (e.g., router CPU, port errors, packet loss within the VPC) remained stable and within normal operating parameters, showing no signs of congestion or failure as the primary cause.
1.

Based on the provided simulated monitoring data, summarize the key observations and anomalous behaviors you identify during the outage timeframe. (10 points)

2.

Given the incident scenario and your analysis of the monitoring data, and considering the following potential culprits: (a) Kubernetes misconfiguration, (b) Database overload, (c) DNS issue, which is the most likely root cause of this 3-hour outage? Justify your choice thoroughly, explicitly referencing the simulated monitoring data descriptions. Explain why the other two options are less likely to be the primary root cause. (20 points)

3.

Describe the specific type of Kubernetes misconfiguration you hypothesize led to the outage. Provide a plausible sequence of events (e.g., a specific deployment change, resource limit misconfiguration, network policy error) that could have triggered the observed symptoms and data. (20 points)

4.

Assuming your identified root cause is correct, what immediate mitigation steps would your SRE team have likely taken during the outage to restore service and minimize impact? (10 points)

5.

Propose 3-5 actionable short-term remediation items to prevent a recurrence of this specific issue. (5 points)

6.

Propose 2-3 actionable long-term remediation items for systemic improvements to enhance overall service resilience and observability. (5 points)

7.

How would you, as the SRE Lead, ensure this post-mortem process adheres to blameless principles? What steps would you take to foster a culture of learning and systemic improvement rather than individual blame? (10 points)

Copyright © 2025 llmstory.comPrivacy PolicyTerms of Service