An Incident Isn't an Event. It's a Definition.

A backend engineer's first instinct when asked to build a log analysis tool is to build the ingestion endpoint.
You add a database table. You wire up a controller. You hit the endpoint with Postman or Curl, see the row appear in Postgres, and feel like you've made progress. Two evenings of work, and ingestion feels done.
I know this because that’s what I did when I started TraceRoot.
Two evenings on ingestion. The next three weeks went into the part I hadn’t planned for: deciding what to do with the logs after they arrived.
Logs are events. Incidents are stateful objects with lifecycles. Going from a stream of POST/logs calls to something that becomes a useful incident is the kind of work that usually doesn’t get written about.
It’s also the work that determines whether the tool produces something engineers actually want to look at, or just a slow events table with extra steps.
This post walks through the decisions TraceRoot makes about what counts as an incident. None of them are technically hard. All of them are opinions encoded as rules. Each one has a wrong default that many systems quietly inherit, and a tradeoff the right choice forces you to accept.
These are mine. They’re not the only valid answers. They’re the ones the system enforces.
What makes two logs the same incident?
Once logs are flowing into Postgres, the next question comes quickly: which of these are the same problem?
This was the first real design decision to make. Two NullPointerException logs from inventory-service are obviously related. But what about a NullPointerException and a TimeoutException from the same endpoint?
Are they the same underlying bug or different ones? What about the same exception type from two different services? What about logs that share a trace ID but happen ten seconds apart?
Every one of these has a defensible answer. The problem is that the system has to pick one and apply it consistently to millions of logs without thinking about it.
TraceRoot uses four fields:
public String buildPatternKey(LogRecord record) {
String exceptionType = normalizeExceptionType(
record.getExceptionType()
);
String endpoint = normalizeEndpoint(
record.getEndpoint()
);
return record.getServiceName() + "|" +
record.getLevel() + "|" +
exceptionType + "|" +
endpoint;
}
That's the fingerprint. Same four fields, same incident. Different on any one of them and it’s a different incident.
The reason these four specific fields, and not others, comes down to what each one captures that the others can't.
ServiceName: Two services failing in similar ways are different incidents. A NullPointerException in payment-service and the same exception in inventory-service might share a root cause, but they have different on-call paths, different rollback decisions, and different blast radii.
Level: ERROR belongs in a fingerprint. WARN is often informational. INFO is usually noise. Mixing them produces meaningless groupings, where a routine warning gets folded into a real incident because it matched the other three fields.
ExceptionType: This is where TimeoutException and NullPointerException get separated even when they happen at the same endpoint. They are different bugs and have different fixes. Grouping them by endpoint alone would mask both.
Endpoint: Two endpoints in the same service throwing the same exception type might point to a shared library bug, or might be two unrelated bugs. Splitting by endpoint preserves that distinction and merging would lose it.
The fields left out of the fingerprint matter just as much as the ones included. Three are tempting to add and would each break the model in a different way:
Message text: Messages drift on every occurrence. "User John Doe timed out after 4827ms" and "User Jane Doe timed out after 5102ms" are the same incident, but their messages differ every time. Including message in the fingerprint would create thousands of one-off incidents that should be one.
Timestamp: Timestamp is what we're trying to move beyond when we go from logs to incidents. Including it would defeat the entire purpose.
Trace ID: A trace is one request. An incident might span thousands of traces. Including traceId would mean every retry storm produces dozens of "incidents" instead of one.
The four fields —serviceName, level, exceptionType, and endpoint — are the smallest set that captures real differences without splitting on noise. That's the whole rule.
How many occurrences make an incident?
Fingerprinting groups logs that belong together. The next question is when a group of logs becomes worth showing to an engineer.
Before settling on three errors in five minutes, I considered the obvious alternatives. Each fails in a specific way.
Threshold of one: Every error becomes an incident. Most ad-hoc systems start here. The result is 200 alerts a day, 195 of which are noise, and on-call engineers who start ignoring the alert channel by week two. The signal-to-noise ratio collapses fast.
Threshold of one hundred in an hour: This catches sustained problems but misses fast-burn incidents that resolve themselves before the count climbs. For example, a payment provider goes down for ninety seconds, throws fifty timeouts, and recovers. This is important to know about, but it never gets surfaced.
No threshold, alert on rate change instead: This is smarter, but requires baseline data the system doesn't have on day one. It’s useful as a layer on top of threshold-based detection, not a replacement.
TraceRoot's choice is three matching errors within five minutes:
public static final int INCIDENT_THRESHOLD = 3;
// inside createLog(...), after active and resolved checks:
LocalDateTime windowStart = LocalDateTime.now().minusMinutes(5);
List<LogRecord> matchList =
logRepository.findByServiceNameAndLevelAndExceptionTypeAndEndpointAndTimestampAfter(
record.getServiceName(),
record.getLevel(),
record.getExceptionType(),
record.getEndpoint(),
windowStart
);
if (matchList.size() >= INCIDENT_THRESHOLD) {
incidentService.createIncident(
fingerPrint,
matchList.size(),
request.getTimestamp()
);
}
Three is enough to distinguish a transient blip from a pattern. A one-off NullPointerException from a malformed request isn't an incident; the same exception three times in five minutes is. Five minutes is short enough that detection fires before an engineer would notice on their own. Long enough that legitimate retries within a single user flow don't trip the threshold.
These numbers are not universal. Three in five minutes is the right default for the kinds of errors TraceRoot is built to catch. A different system, with different traffic patterns or different cost-of-missing-an-incident, would tune them differently. The point is not the specific values. The point is that there is a threshold, and it is explicit, and it lives in one place where it can be changed.
What this misses, on purpose is a single critical error that should fire without waiting. A
DataCorruptionExceptionhappening once is more important than threeTimeoutExceptionsin five minutes. Threshold-based detection doesn't know that. Severity-based override paths solve it. TraceRoot doesn't have one yet.
When is an incident closed?
This is the decision many observability tools either skip or get wrong. A naive design says incidents close when an engineer marks them resolved. New occurrences of the same fingerprint create new incidents. It's clean, easy to implement and also the wrong model for this problem.
I learned this the hard way on a previous team. We had a database query that timed out for one user, every Wednesday afternoon, for about three months. Same query, same error, same fingerprint. Our incident tool created a fresh incident every Wednesday. Twelve different incidents over twelve weeks. Each one got triaged from scratch. Each one got resolved as "intermittent, can't reproduce." Each one got closed.
It wasn't twelve different bugs. It was one bug, showing up twelve times. The tool couldn't tell us that because it didn't model continuity over time.
TraceRoot models continuity through a reopen window. When a fingerprint matches a resolved incident within twenty-four hours, the existing incident reopens instead of creating a new one.
LocalDateTime resolvedAt = incident.getResolvedAt();
if (resolvedAt == null) {
return false;
}
Duration reopenWindow = Duration.between(
resolvedAt,
LocalDateTime.now()
);
if (reopenWindow.toHours() > 24) {
return false;
}
incident.setIncidentStatus(IncidentStatus.ACTIVE);
incident.setEventCount(incident.getEventCount() + 1);
incident.setSummaryStale(true);
incident.setResolvedAt(null);
What the engineer sees is one incident that accumulates events over the day, with the recovery and recurrence visible in the metadata. Not twelve duplicates of the same problem.
Why twenty-four hours specifically? Many "the bug came back" cases happen within a working day. After that, code has been deployed and dependencies have shifted. A new occurrence is more likely to be a new incident at that point. Twenty-four hours captures the worst-case "fix didn't actually work" window without dragging stale context forward forever.
The trade-off is a long-running intermittent bug that recurs every thirty hours, but never reopens. It looks like an infinite series of fresh incidents. The fix is a longer window, but that creates its own problems by carrying old incidents into new code. Twenty-four hours is the line that catches most cases without making the incident table a graveyard.
Where the rules stop
These decisions get you a working incident model. They don't get you a complete one. Three real gaps:
Single critical errors: A DataCorruptionException happening once matters more than three timeouts in five minutes. Threshold-based detection delays the first alert by definition. The fix is a severity-based fast path that bypasses the count for known-critical exception types. TraceRoot doesn't have one yet.
Cross-service correlation: A payment-service timeout often causes an order-service NullPointerException two seconds later. The fingerprint logic treats them as separate incidents. They are related, and the system has no way to know it. Span-level correlation in a tracing system solves this. The incident model alone can't.
Rate changes against baseline: A service that used to throw zero errors per hour and now throws fifty isn't caught by a threshold of three. The slope matters, not just the floor. This is a different detection algorithm — historical baselines, statistical confidence — running alongside fingerprinting, not replacing it.
Most observability tooling implies more capability than it delivers. Knowing where the rules stop is what makes the rules trustworthy.
Each one is a legitimate detection problem in its own right, with its own algorithms, data requirements, and trade-offs. Threshold-based fingerprinting is the foundation other approaches build on, not a replacement for them.
The reason to be explicit about scope is that most tooling isn't. Dashboards imply more capability than they deliver. Knowing the difference matters when you're choosing what to trust.
Why detection is the hard part
For many teams, ingestion, storage, and search are mostly solved problems. You can wire up a competent pipeline in a weekend, and Postgres or OpenSearch handle the rest.
Detection isn't fully solved. Not because the algorithms are hard. They aren't. The threshold check in this article is six lines of code. The fingerprint is a method that joins four strings. The reopen logic is a date comparison.
Detection is hard because every rule embeds a worldview about what counts as one thing. Get the worldview wrong and the incident table becomes noisy and unreliable. Too many incidents, too few, or the wrong ones grouped together. Get it right, and an on-call engineer at 11 p.m. sees 3 incidents instead of 847 events. The work the system did to decide what counts as one thing is what made that list useful.
The point isn't that TraceRoot's worldview is the right one. It’s that somebody has to encode a worldview, and that decision determines everything downstream. The summary, the dashboard, the alert, and the postmortem all inherit that decision.
The full implementation is in GitHub.



