When a distributed system starts throwing errors, the pressure to react fast can lead you straight into two classic traps: chasing symptoms instead of causes, and treating every incident as a one-off. This guide from Northpoint's Distributed Systems Triage blog breaks down why these traps are so seductive, how they manifest in real-world outages, and—most importantly—how to route around them.
We walk through the 'firefighting loop' that keeps teams stuck, the difference between monitoring and observability, and why your runbooks might be making things worse. You'll learn concrete triage patterns that reduce mean time to resolution without burning out your on-call engineers, plus anti-patterns that look like progress but aren't. The article also covers when triage-driven fixes are the wrong call, maintenance drift that erodes your response playbooks, and a set of open questions to help your team audit its incident response maturity.
Whether you're a site reliability engineer, a platform lead, or a developer who gets paged at 3 AM, this field guide gives you a repeatable approach to cut through the noise and fix what actually broke.
1. The Firefighting Loop: Why Triage Traps Feel Productive
Imagine you're on call for a microservices platform. Alerts start piling up: payment latency is spiking, user sessions are dropping, and a downstream cache cluster is reporting errors. Your instinct is to jump on the most visible symptom—maybe restart the cache, scale up the payment service, or roll back a recent deploy. That fixes the immediate pain, but an hour later a new alert fires in a different service. Sound familiar?
This is the firefighting loop: a cycle of reactive fixes that address surface-level issues without resolving underlying faults. The loop feels productive because each action produces a short-term improvement. But over weeks and months, the system accumulates technical debt, on-call fatigue rises, and the same incidents recur with different labels. The two most common triage traps—symptom chasing and treating incidents as isolated events—are both symptoms of this loop. Let's look at each in detail.
Trap 1: Symptom Chasing
Symptom chasing happens when you fix what's noisy rather than what's broken. For example, a high error rate in service A might be caused by a misconfigured timeout in service B, but restarting service A reduces the error rate temporarily. The real problem—the timeout—remains, and it will manifest again under load. Teams fall into this trap because symptoms are easier to see than causes. A dashboard full of red metrics demands action, and restarting a container feels decisive.
The cost is insidious: each symptom fix adds operational overhead (new runbook steps, temporary config changes) that makes the system harder to understand. Over time, the gap between what you observe and what's actually happening widens. Northpoint's approach is to resist the urge to act on the first alert. Instead, we pause to ask: "What else changed in the last 30 minutes?" Often, the root cause is a recent deploy, a config push, or a dependency degradation—not the service that's currently on fire.
Trap 2: Treating Every Incident as a One-Off
The second trap is the belief that each incident is unique. When you treat every page as a novel event, you start each investigation from scratch. You don't update runbooks, you don't look for patterns across incidents, and you don't build automation to prevent recurrence. This approach is exhausting and slow. It also leads to "tribal knowledge"—the only person who remembers how to fix a recurring issue is the engineer who happened to be on call the last three times.
A classic example: a team I read about had a recurring database connection pool exhaustion every two weeks. Each time, the on-call engineer restarted the application servers, which freed connections and resolved the alert. But nobody asked why the pool was exhausting. After six months, an SRE finally traced it to a batch job that opened connections without closing them. The fix was a one-line code change. Treating each exhaustion as a one-off cost the team dozens of late-night pages and hundreds of hours of cumulative investigation.
To route around this trap, Northpoint recommends a "post-incident pattern review" every two weeks. Look at the last 10 incidents and ask: Do any share a common cause? Are there alerts that always precede others? This shifts your team from reactive to pattern-aware.
2. Foundations Readers Confuse: Monitoring vs. Observability
One reason teams fall into triage traps is confusion about what their tooling actually provides. Monitoring tells you something is wrong; observability lets you understand why. Many teams think they have observability because they have dashboards and alerts, but they're really just monitoring at scale. The distinction matters because it shapes your triage strategy.
Monitoring is about known unknowns: you set thresholds for CPU, memory, latency, error rate, and you get alerts when those thresholds are breached. That's useful, but it only covers what you thought to measure. In distributed systems, the most damaging failures are often emergent—they result from interactions between components that no single metric captures. Observability, by contrast, is about unknown unknowns: it means you can ask arbitrary questions about your system's state without having predicted the question in advance. This requires structured logging, distributed tracing, and high-cardinality metrics.
Why Teams Mistake Monitoring for Observability
A typical scenario: a team invests in a monitoring platform, configures dashboards for all their services, and sets up alerts for p99 latency and error rate. They feel prepared. Then a new feature deploys that changes the data flow between services. The dashboards still show green metrics for each individual service, but end-to-end latency doubles. The team has no way to trace the request path, so they can't isolate the bottleneck. They restart services randomly, hoping something sticks. This is monitoring without observability.
The fix is to instrument your system with distributed tracing from the start. Tools like OpenTelemetry let you propagate trace context across service boundaries, so you can follow a single request through the entire chain. When an alert fires, you open a trace view and see exactly which hop added the most latency. That turns triage from guesswork into directed investigation.
Practical Steps to Bridge the Gap
If you're currently relying on monitoring alone, here are three steps to move toward observability:
- Add trace context to all inter-service calls. This is a one-time engineering investment that pays for itself in the first incident you debug.
- Define service-level objectives (SLOs) based on user-facing behavior, not infrastructure metrics. An SLO like "99% of checkout requests complete in under 2 seconds" is more actionable than "CPU < 80%".
- Practice "exploratory debugging" during low-severity incidents. Use the same tools you'd use in a crisis—traces, logs, metrics—to understand what happened. This builds muscle memory for when it matters.
3. Patterns That Usually Work: Structured Triage in Practice
Once you've avoided the two traps and built a solid observability foundation, you need a repeatable triage process. The following patterns have proven effective across many distributed systems teams.
The Triage Tree
A triage tree is a decision tree that guides an on-call engineer from alert to root cause. It starts with broad categories (e.g., "Is the issue in our code, our infrastructure, or a dependency?") and narrows down based on observable signals. For example:
- Is latency high for all users, or only a subset? (If subset, look for regional or tenant-specific causes.)
- Did the issue start after a recent deploy? (If yes, rollback and investigate.)
- Are related services also degraded? (If yes, look for a shared dependency like a database or message queue.)
The key is that the tree is written down and maintained. It lives in your runbook repository and gets updated after every incident. Over time, the tree becomes a map of your system's failure modes.
The 5-Second Rule
When an alert fires, the first thing to do is nothing for five seconds. Use that time to read the alert description, check if it's a known pattern, and decide whether to escalate. This simple pause prevents the knee-jerk restart that often masks the real problem. Northpoint enforces this rule by requiring on-call engineers to acknowledge the alert in the chat channel before taking any action. The acknowledgment message includes a one-sentence hypothesis: "I suspect the payment service is slow because the database is under load." This forces a moment of thought.
Blameless Post-Incident Reviews
After an incident is resolved, hold a blameless review within 48 hours. Focus on what the system did, not who did what. The goal is to identify contributing factors—changes, configs, missing tests—and create action items to prevent recurrence. Blameless reviews are not about assigning fault; they're about improving the system. Teams that skip this step are more likely to repeat the same mistakes.
4. Anti-Patterns and Why Teams Revert
Even with good intentions, teams often slip back into reactive triage. Here are the most common anti-patterns and why they're so tempting.
The "Hero" On-Call Engineer
Some teams rely on a single engineer who knows the system inside out. When that person is on call, incidents are resolved quickly. When they're not, resolution time doubles. This pattern is unsustainable: the hero engineer burns out, and the rest of the team never develops deep system knowledge. The fix is to rotate on-call duties and require runbook updates for every new failure mode.
Runbooks That Are Never Updated
A runbook that's written once and never revised becomes a liability. It may contain outdated commands, wrong IP addresses, or steps that no longer apply. On-call engineers quickly learn to ignore the runbook and rely on tribal knowledge. To prevent this, Northpoint recommends a "runbook health check" every quarter: review each runbook for accuracy, test the steps in a staging environment, and remove any that are no longer relevant.
Alert Fatigue
When every minor metric deviation triggers an alert, engineers start ignoring them. The signal-to-noise ratio drops, and real incidents get buried. The solution is to tune alert thresholds based on historical data and to suppress alerts that don't require human action. Use automated remediation for the 80% of alerts that are routine (e.g., restart a failed process) and page humans only for the 20% that need judgment.
Why Teams Revert
Teams revert to anti-patterns because they're easy. Writing a runbook update takes time; restarting a container takes seconds. Holding a blameless review requires emotional energy; declaring the incident closed requires none. The key to sustained improvement is to make the right thing the easy thing: automate runbook updates, schedule regular reviews, and celebrate pattern detection over quick fixes.
5. Maintenance, Drift, or Long-Term Costs
Even with a solid triage process, systems drift over time. New services are added, dependencies change, and the original assumptions behind your runbooks and dashboards become outdated. This drift has a cost: it increases mean time to resolution (MTTR) and erodes trust in your monitoring.
The Cost of Runbook Decay
Consider a runbook written a year ago for a database failover. At the time, the database was version 12, and the failover script was in /opt/scripts/. Today, the database is version 15, and the script has been moved to a new path. The on-call engineer who follows the runbook will waste 10 minutes searching for the script before realizing it's not there. That 10 minutes adds up across every incident, and it's completely avoidable.
Northpoint's approach is to treat runbooks as code: they live in a repository, undergo peer review, and have version history. When a change is made to the infrastructure, the corresponding runbook must be updated as part of the same ticket. This prevents drift at the source.
The Cost of Dashboard Sprawl
Dashboards multiply like rabbits. Each team creates its own, and soon you have hundreds of dashboards, many of which are redundant or broken. Engineers waste time searching for the right dashboard or trying to interpret conflicting metrics. The solution is to enforce a dashboard lifecycle: every dashboard must have an owner, a documented purpose, and a quarterly review date. If a dashboard hasn't been viewed in 90 days, it's archived.
The Hidden Cost of Incident Debt
Every incident that is resolved without a root cause fix adds to "incident debt"—the accumulated risk of unknown failure modes. This debt manifests as unpredictable outages that require longer and longer to debug. The only way to pay it down is to allocate time for proactive investigation, separate from on-call duties. Northpoint recommends that each team spend 20% of its capacity on "operational improvement"—tasks like reducing alert noise, improving runbooks, and building automation.
6. When Not to Use This Approach
The triage patterns described here are designed for systems where reliability matters and where you have the engineering resources to invest in observability and process. But not every system needs this level of rigor.
When Triage Overhead Exceeds Benefit
If your system is a prototype or an internal tool with few users, the cost of building distributed tracing, maintaining runbooks, and holding post-incident reviews may outweigh the benefit. In such cases, a simpler approach—like restarting services on failure—may be sufficient. The key is to be intentional: choose the simpler path because it's appropriate, not because you haven't thought about alternatives.
When the Failure Mode Is Known and Rare
If you have a single known failure mode that occurs once a year (e.g., a database that fills up on a specific date), a dedicated runbook page is enough. You don't need a full triage tree. The trap is over-engineering your response for scenarios that almost never happen.
When the Team Is Too Small
A two-person team can't sustain a formal on-call rotation with post-incident reviews every week. In that case, focus on the highest-impact patterns: automate the most common remediation steps and keep a shared document of known issues. You can adopt more formal processes as the team grows.
In all cases, the principle is the same: match your triage investment to the value of the system's uptime. A system that generates $1M per hour of downtime justifies a full observability stack and a dedicated SRE team. A weekend hobby project does not.
7. Open Questions / FAQ
This section addresses common questions teams have when adopting structured triage.
How do we convince leadership to invest in observability?
Frame it in terms of cost avoidance. Calculate the average MTTR for the last 10 incidents and multiply by the cost of downtime per hour. Then estimate how much observability tooling would reduce that MTTR. Even a rough calculation often makes the case. Also, point to industry surveys that show teams with distributed tracing resolve incidents 50-80% faster—but be careful not to cite a specific study without verification; instead, say "many engineering blogs report significant MTTR reductions after implementing tracing."
What if our system is a monolith, not microservices?
The same principles apply, but the implementation is simpler. You still need structured logging and the ability to trace requests through the monolith's internal modules. The triage tree can be shallower because there are fewer service boundaries.
How often should we update runbooks?
After every incident that involves a new or changed step. Additionally, do a full review quarterly. Runbooks that haven't been touched in six months are likely out of date.
What's the best way to start if we have no observability today?
Start with structured logging. Add a correlation ID to every request and log it in all relevant services. That alone will make debugging much easier. Then add distributed tracing for the most critical user journeys. Don't try to instrument everything at once—pick a single transaction (e.g., checkout or login) and trace it end-to-end. Learn from that experience before expanding.
Should we automate all remediation?
No. Automate only the steps that are well-understood and unlikely to change. For example, restarting a failed process is safe to automate; diagnosing a new type of error is not. Leave the judgment calls to humans. Over-automation can mask problems and lead to the same symptom-chasing trap we discussed earlier.
8. Summary + Next Experiments
The two most common triage traps—chasing symptoms and treating incidents as one-offs—are easy to fall into and hard to escape. But with a structured approach built on observability, decision trees, and blameless reviews, you can route around them. The key is to invest in process and tooling before the next crisis, not during it.
Here are three experiments to try in your team over the next month:
- Implement the 5-second rule. Require on-call engineers to write a one-sentence hypothesis before taking action. Track how often the hypothesis matches the root cause.
- Run a pattern review. Gather the last 10 incidents and look for common causes. If you find any, create a runbook entry for that pattern and set up a targeted alert.
- Audit one runbook. Pick the most critical runbook (e.g., database failover) and verify every step in staging. Fix any inaccuracies and schedule a quarterly review.
These experiments won't fix everything overnight, but they will start shifting your team from reactive firefighting to proactive triage. Over time, you'll see fewer repeat incidents, lower MTTR, and a calmer on-call rotation. That's the Northpoint way.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!