Skip to content

Determining an Acceptable False Positive Rate for Your SOC

Acceptable FPR isn't a vibes problem, it's a math problem. Plug your environment into the calculator and find the actual number your program can tolerate.

If you follow my work, you know how much I harp on the base rate fallacy. It’s one of the most important concepts a detection engineer can familiarize themselves with, and one of the least understood. In my last post I walked through why false positive rate dominates over detection rate, and why even a respectable-sounding FPR can flood an analyst’s queue with garbage. What I didn’t answer was the obvious follow-up. What’s an actually acceptable false positive rate for my org?

That question gets dodged constantly. The industry tells you false positives matter, that alert fatigue is real, that FPR is the metric that breaks SOCs. Then it stops short of telling you what number you should be aiming for. So detection engineers tune to a vibe. A rule “feels noisy” or it “feels clean”, a queue is “manageable” or it isn’t, and the threshold for what counts as acceptable shifts with whoever is on triage that week.

The good news is this isn’t a vibes problem. It’s a math problem. The inputs come from your environment, the volume of events your sensors generate per day, the rough number of intrusions you expect to see, the events each intrusion produces, and the analyst capacity you have to triage what comes out the other side. Plug those in and you can calculate the FPR your program can actually tolerate, both per rule and as a whole.

The Base Rate Fallacy in Detection

Every SOC operates on a wildly lopsided base rate. The volume of benign activity in your environment is enormous compared to the volume of malicious activity, and that imbalance is the reason FPR dominates the conversation.

Use the example from the Mythos post. A small environment generates a million events per day. There are two actual intrusions per day, and each one produces ten events that should be detectable. That’s twenty intrusive events out of a million total, or roughly 999,980 benign events to twenty intrusive ones. The probability that any single event sitting in the pipeline is part of a real intrusion is 0.00002. Two events out of every hundred thousand.

A detection rule isn’t operating against a balanced population, it’s looking at a million events to find twenty.

Take a rule with a perfect detection rate, meaning it catches all twenty intrusive events, and a false positive rate of 0.00001. The benign population is so much larger that even that tiny rate produces ten false positives a day, since 1,000,000 × 0.00001 = 10. Twenty real alerts out of thirty total. The analyst can plausibly work that queue.

Now nudge the false positive rate up to 0.001, a number that still sounds respectable on paper. The intrusion side doesn’t move, twenty real alerts. The benign side explodes to 1,000,000 × 0.001 = 1,000 false positives. Twenty real alerts buried in 1,020 total. Each alert in the queue has a roughly 2% chance of being part of a real intrusion. The analyst doesn’t know which 2% without working all 1,020.

A false positive rate that looks reasonable in isolation is catastrophic when it gets multiplied by a benign population roughly 50,000 times larger than the intrusive one. To know whether your numbers are anywhere near acceptable, you need two things. The actual numbers your environment produces, and a clear understanding of why detection rate and false positive rate aren’t trading off the way most engineers assume they are.

Where the Numbers Actually Come From

The math only works with real numbers. Three inputs describe the event side, the volume of events your environment generates, the rough number of intrusions you should expect, and the events each intrusion tends to produce. A few more describe the capacity side, the team’s actual ability to triage what fires. The first three drive both per-rule and program-wide math. The capacity inputs only matter program-wide, where they set the ceiling. Each one is sourced differently.

Event volume is the easiest. Your SIEM, EDR, and XDR platforms all report daily ingestion totals. Pull a thirty- to ninety-day average from whatever you treat as your detection plane, the surface where rules actually fire. Don’t double-count if you have overlapping pipelines, and don’t include sources you don’t write detections against. The number you want is the population a rule could fire on, not your total log spend.

Intrusion frequency is harder, and the place most teams fudge the numbers. The cleanest source is your own incident response history, closed cases with confirmed malicious activity over the last twelve to twenty-four months. If your IR data is thin, look outward. Mandiant’s M-Trends, the Verizon DBIR, and your sector’s ISAC reports all publish breach frequency data broken down by industry and company size. Peer comparisons from CISO networks help triangulate. If you’re brand new and have no internal data, build a threat-model estimate from your attack surface, exposed assets, and the kinds of actors that target your sector. Be honest about the uncertainty, and consider running the calculator against both an optimistic and a pessimistic estimate to see how sensitive your acceptable FPR is to that assumption.

Events per intrusion is the bridge between intrusion count and intrusive event count, and it’s the input most engineers haven’t thought about explicitly. Sources include closed IR cases where you have the full event timeline, red team and purple team engagements, breach-and-attack simulation runs, and vendor incident reports that publish event-level detail. Even rough numbers help. An intrusion that fires on five distinct events is a different problem from one that fires on fifty, and your acceptable FPR shifts accordingly.

Analyst capacity is what turns per-rule math into program-wide math. You need four numbers. The number of analysts actually on triage, the hours each works per day, the average time it takes to clear a false positive, and the average time it takes to investigate a true positive. Pull the time numbers from your case management system, ticket close-out times averaged over the last quarter or two work fine. If you don’t track that, time-box a week of triage and have analysts log it. Headcount and hours come from your staffing model, but account for vacation, training, and on-call rotations rather than using nominal full-time-equivalent counts. The output is a daily alert ceiling. Above that ceiling, the queue grows faster than the team can drain it, and the program is broken regardless of how good any individual rule looks.

Plug them into the calculator further down and you’ll have a starting point that beats whatever you’ve been guessing at.

Detection Rate and False Positive Rate Aren’t Inverses

This is the misconception that ruins back-of-the-envelope tuning. Engineers see a rule with a 99% detection rate and assume the false positive rate is the other 1%. It isn’t. The two numbers are computed against completely different populations, and they don’t have to trade off with each other in any clean way.

Detection rate is true positives over actual intrusive events. Using the example, twenty intrusive events per day means the most detection rate can ever do is twenty hits, and a rule’s detection rate is the fraction of those it catches. If it gets all of them, detection rate is 1.0. If it catches half, 0.5. The ceiling is twenty.

False positive rate (FPR) is false positives over actual benign events. The denominator there is roughly 999,980. A false positive rate of 0.001 generates a thousand false hits because the rule is being measured against a population almost a million events large. Same arithmetic, completely different scale.

The two rates can move together when you tune one rule’s threshold, the standard ROC curve picture. Loosen the threshold and both go up, tighten it and both go down. But the magnitudes don’t trade evenly. Pushing detection rate from 0.9 to 1.0 buys you two more real catches. Pushing false positive rate from 0.0001 to 0.001 generates 900 more false hits. The detection rate side is bounded by twenty. The false positive rate side scales with the benign population.

That asymmetry is the whole game. Every conversation about whether a rule is acceptable, whether a program is sustainable, whether an analyst can keep up, comes back to false positive rate getting multiplied by a number that detection rate never sees. The next step is putting program-wide numbers on it, what FPR your environment can actually tolerate.

Determining FPR for the Program as a Whole

For a program in aggregate, acceptable FPR is bounded by two ceilings, and the binding one depends on your environment.

The Trust Ceiling

Pick a target Bayesian detection rate (BDR), the probability that any given alert is a real intrusion, and solve Axelsson’s Bayes formula for the maximum FPR that delivers it. Using his example numbers, one million events per day, two intrusions per day, ten events each, a Bayesian detection rate of 66% requires an FPR no higher than 0.00001, and that’s at a perfect detection rate. At a more realistic 0.7 detection rate, the same FPR yields a BDR closer to 58%. Drop the BDR target lower and you can tolerate a higher FPR. Raise it and you can’t.

Axelsson’s bottom line, for this kind of environment, is what he calls the “very high standard” of less than 1/100,000 per event. He gets there by combining the math with research from process-control settings, nuclear plants, paper mills, ship bridges, that shows human operators start ignoring alerts well before the false-positive rate gets to 50%. Below that threshold of trust, the alert system becomes background noise regardless of whether the underlying detection works.

The Capacity Ceiling

Multiply your triage analysts by the hours each works per day, then divide by the weighted average minutes per alert. That’s your daily alert ceiling. Subtract expected true positives, divide by total benign events, and you get the FPR your team can absorb without the queue growing faster than they can drain it.

Axelsson uses 100 false positives per day as a benchmark for what a single human can reasonably handle. The capacity formula refines that benchmark with your team’s actual staffing and triage times. It’s an operational extension of Axelsson’s ceiling, not in the paper as a formula, but in the spirit of his Section 5.2 argument that the analyst is the ultimate bottleneck.

Which Ceiling You’ll Hit First

Acceptable program-wide FPR is whichever ceiling you hit first. The lower number wins.

If trust hits first, alert quality is the bottleneck. Your team has triage capacity to spare, but the alerts they’re working aren’t trustworthy enough to act on, and they’ll learn to ignore them.

If capacity hits first, the queue is the bottleneck. Your alerts are trustworthy, but there are too many of them, and the team can’t drain the queue fast enough to keep up.

Known Limitations and Defaults

Axelsson’s framework treats the IDS as a single binary classifier evaluating a single population. Modern detection planes don’t look like that, you have many rules each running against its own slice of telemetry, and the cleanest math is per-rule. Axelsson himself flagged this in his future-work section as the “unit of analysis problem”, noting that “we have somewhat skirted the issue, by declaring the unit length to be 10 audit records” and that a more thorough study “would define different units of measurement for both different intrusion detection mechanisms, and different types of intrusions” (§7, pp. 201-202). Per-rule isn’t a contradiction of the paper, it’s the open work the paper called out. I’ll cover how per-rule denominators work in a follow-up post, when the math holds (clean populations SIEM rules) and when it doesn’t (black-box NGAV engines that don’t expose what they evaluated). The program-wide framing is still accurate enough to give you a directional answer for the program as a whole.

The other gap is events per intrusion. To compute detection rate honestly, you need the count of intrusive events your detectors evaluated, including the ones your rules missed, and most teams don’t routinely collect this as a metric. It surfaces after the fact through closed IR cases during the after action review when the detection team looks for missed opportunities. If you don’t have data of your own, Axelsson’s defaults work as a starting point, two intrusions per day, ten events per intrusion (§5.1, p. 191), and a target BDR of 66% read off his Bayes plot (§5.3, p. 193). Most orgs don’t see an intrusion every day, decimals are fine, use 0.5 for one every other day. Run the calculator with those numbers to anchor yourself, then swap in your own as you collect them.

Acceptable FPR Calculator

Plug in your numbers. The lower of the two ceilings is the FPR your program can actually tolerate.

Your Environment

Your Team

Your Maximum FPR
Trust CeilingHits first
Capacity CeilingHits first
Lower detection rate means fewer real intrusions in the alert mix (less trust) and fewer real cases to work (more capacity). The two ceilings move in opposite directions.

Closing

Determining an acceptable false positive rate doesn’t have to be a feeling. It can be a number you can defend with math. Axelsson laid out the trust ceiling using Bayes, the operator-trust threshold comes from process-control research that predates it, and the capacity ceiling is just arithmetic against your own staffing. Plug in your numbers, take the lower of the two ceilings, and that’s your target. The next time you justify killing a rule or capping a queue, you can point at a derivation instead of an opinion.

Subscribe to join the discussion.

Subscribe