Oct 04, 2025 Detection

Fuzzy Hashing Research: A Paper Highlight with Practitioner's Notes

A new paper questions fuzzy hashing, but real-world data tells a different story. I share practical lessons for reducing false positives and argue that the future of TLSH isn't in alerting, it's in enriching events to create high-fidelity detections.

I recently found a great research paper on fuzzy hashing detection that's worth highlighting named Bytewise approximate matching: Evaluating common scenarios for executable files^[1]. I work extensively with fuzzy hashes, particularly TLSH, and I'm always on the lookout for new research on the topic.

This is the focus of my LimaCharlie Extension^[2], named CelesTLSH, pronounced Celestial-S-H^[3], which is used by just under 100 unique organizations across all major operating systems.

I don’t have visuals because I treat client data like toxic waste and collect as little as possible. Instead, I’m sharing anecdotal lessons from the real-world tactics I've used to help clients cut down on TLSH false positives. I'll cover what worked, what didn't, and why fuzzy hashing is still a critical tool for defenders.

Paper Highlights

I really think you should read the paper for yourself, and because of that I'm only going to share some of the relevant findings as to not steal the authors thunder.

Check it out here!: https://www.sciencedirect.com/science/article/pii/S2666281725000666 .^[1:1]

NOTE: Normally, when it comes to TLSH the lower score the more similar the two files. A high score means they're very different.

For this study, the researchers flipped that logic to make it easier to read, like a percentage. They converted the score so that 100 means 'very similar' and 0 means 'not similar at all'. This just helps compare TLSH's results directly against other tools that also use a 0-100 similarity scale. The normalization method is defined in the paper for those curious.

TL;DR: For this research paper, higher is more similar in this research paper, lower is less similar.

The researchers put four major fuzzy hashing algorithms (TLSH, ssdeep, sdhash, and MRSHv2) to the test across a few common scenarios, like tracking software updates and separating malware from benign OS files. Their overall conclusion was pretty interesting: the performance was generally unsatisfactory across the board.

They found that for tasks like telling malware apart from legitimate files, the algorithms often get confused by shared code from libraries or runtimes, leading to misclassifications. The researchers did find, however, that:

an intra-family comparison revealed that TLSH may be moderately helpful in identifying similar versions within a labeled set, reducing the number of versions for detailed analysis.

Of the fuzzy hashing algorithms, it appears that TLSH was the most effective when combined with labeled data.

However, their ultimate conclusion was:

bytewise approximate matching should not be used as a standalone, all-encompassing solution for most scenarios we considered

I agree with this conclusion, but with some nuance which we'll cover later.

One of the most significant findings wasn't about the algorithms themselves, but their code. The researchers found a major bug in the sdhash reference implementation that caused massively inflated similarity scores, affecting over 50% of the files they tested. They also uncovered a more minor issue in MRSHv2's comparison logic, which could misidentify the smaller of two hashes and produce faulty similarity scores.

The Role of Hashes in Modern Detection

The paper’s conclusion is that fuzzy hashes shouldn't be a standalone solution. This lines up with a bigger shift in how we should think about indicators of compromise (IOCs). The days of relying on simple, atomic indicators for alerts are mostly behind us outside of DNS.

Traditional file hashes like SHA256 are nearly useless for proactive detection. An attacker can flip a single bit to generate a brand-new hash, making signature-based alerts obsolete. David Bianco’s analysis of malware submissions drove this point home: hashes seen by more than 10 organizations made up just 0.11% of all malicious files^[4] evaluated. The chance of a hash from a threat feed actually appearing in your environment is almost zero. Their real strength, as Bianco notes, is in retro-hunting, using a known-bad hash to find other instances of an infection after you've confirmed it.

Fuzzy hashes are different, but as the research shows, they introduce false positives which traditional hashes do not, and because of this need to be handled with care. For the most part with the exception of smaller/mid-size organizations, I agree with the paper's authors: they can't be the only signal you rely on. So, what's their role?

The future for indicators like fuzzy hashes is in enrichment, not alerting. Instead of a fuzzy hash match triggering a standalone alert for a SOC analyst to chase, it should act as another piece of context for a more sophisticated detection model.

Imagine a process execution event from your EDR that's automatically enriched with data like: "This binary is 85% similar to Sliver." This context is a powerful new signal for risk-based alerting models and gives hunters a new pivot point. It also gives us new ways to think about behavioral detection.

Let's use a simple scenario. Suppose you work for a nightmare organization which commonly uses Microsoft Word to execute PowerShell, to then download and run encoded scripts for some reason. You're trying to find malicious uses, like a Word document spawning PowerShell to launch an attack tool.

An alert for winword.exe spawning powershell.exe might be too noisy on its own, even when combined with the behaviors of downloading and executing encoded strings from the internet.
An alert may not be triggered for powershell.exe spawning the malicious process that isn't another LOLBIN, because as mentioned before this may be normal in your environment and utilizes the Windows API which is nearly invisible to most blue teams and limited to closed machine learning models.

If you were to enrich your process executions with the TLSH matches, you can build a much smarter detections. For example: winword.exe spawns powershell.exe, which then runs a file that is >80% similar to a known attack tool. Suddenly, a common, noisy behavior becomes a mucher higher-fidelity signal worth investigating. You’re not using the fuzzy hash as a verdict; you're using it as a crucial piece of evidence to add weight to a suspicious chain of events. This is where their real power lies in the future.

Why Fuzzy Hash Enrichment is Still Out of Reach

Using fuzzy hashes in a detection or threat intelligence program is still out of reach for most organizations. It's typically limited to teams with the in-house expertise and resources to build and manage the service themselves. There are a few reasons for this:

First, scaling. Traditional hash lookups are simple one-to-many checks. You take one hash and check it against a massive list. In computer science terms, its simply checking if a string is contained within an array. TLSH is different. To find a match, you have to calculate the "distance" between a new file and every single hash in your threat database. What was once a single lookup becomes thousands of comparisons. As your collection of threat hashes grows, so does the performance cost.

This performance hit can lead to higher CPU usage or delays in log generation. As a result, real-time enrichment at the moment of execution is often not practical. It's why most EDR tools that calculate a TLSH hash don't perform the distance check themselves; they provide the hash and expect your team to handle the comparison out-of-band.

Second, slow adoption. The security community has been slow to adopt fuzzy hashing as a primary detection method. While this is changing as detection programs mature, the performance issues and the potential for false positives, as highlighted in the research paper, have made teams hesitant. Anecdotal consulting experience has shown that fuzzy hash use in detection programs is typically used by teams who have a very mature detection program already and are looking for new/novel ways to detect activity.

Third, a lack of high-quality data. The scaling and false positive risks mean you need a well-maintained list of threat hashes to get value. Slow adoption has created a data vacuum. Most threat intel platforms only calculate a TLSH hash if you upload a file; they don't offer well-categorized feeds. Outside of a few paid enterprise services, almost none provide a way to search by hash similarity. This forces organizations to build their own tooling and collect their own hashes, which they rarely share publicly, creating a cycle that reinforces the slow adoption.

Note: I run a feed that tracks as many publicly hosted attack tools and C2 frameworks as possible, and I add new ones consistently. You can find it here: https://github.com/magonia-Research/CelesTLSH-Hashes/^[5]

My Detection Environment

Before discussing the nuances of false positives, it helps to understand my detection environment and how CelesTLSH works. This context explains why my real-world results might differ from the paper's findings.

CelesTLSH is a LimaCharlie extension deployed in just under 100 organizations across all major operating systems, covering thousands of endpoints. LimaCharlie's Binary Library (BinLib)^[6] feature captures a single, unique copy of any file that generates a CODE_IDENTITY^[7] event (like a binary execution or DLL load).

This is a key detail: if the same code executes on a thousand systems, BinLib only stores one copy. This has a major impact on false positives. If a benign tool deployed everywhere creates a match, it generates a single alert, not a flood of identical ones from every machine. This is important nuance to understand potential differences in findings from the researchers.

LimaCharlie's BinLib provides rich metadata for each unique binary, including its TLSH hash:

{
  "action": "first_seen",
  "file_path": "C:\\Windows\\System32\\mshtml.dll",
  "mtd": {
    "imp_hash": "48eb7011e604f6dd818967576f19dd15",
    "res_company_name": "Microsoft Corporation",
    "res_file_description": "Microsoft (R) HTML Viewer",
    "res_product_name": "Internet Explorer",
    "res_product_version": "11.00.26100.4343",
    "sha256": "62f101b4f93d9b5d01b81a701e678dfc254a314c09ced2d47e2f1a4974dc8189",
    "sig_authentihash": null,
    "sig_issuer": null,
    "sig_serial": null,
    "sig_subject": null,
    "size": 24117248,
    "tlsh_hash": "aa377b2a26f451c9d5b6e038865b8f4aebb27c25233147cb016179791f377e16a3e3b0",
    "type": "pe"
  },
  "needs_fetch": false,
  "op_id": "753edaca-a994-4dbc-b6a4-bab90041c192",
  "sha256": "62f101b4f93d9b5d01b81a701e678dfc254a314c09ced2d47e2f1a4974dc8189",
  "sid": "9467afec-859d-41f0-b7f3-44252ddf4ead"
}

When CelesTLSH sees these first_seen events, it takes the tlsh_hash and compares it against my curated database. If a match is found, the extension creates an event with the name of the matched threat and its similarity score as well as other contextual metadata with the option to alert on it. By default, it generates an alert for any match with a distance score of 50 or less, though this is adjustable.

Most organizations using this are under 1,000 endpoints. At the time of this writing, the CelesTLSH database contains 102,422 unique TLSH hashes from 311 malware families and attack tools, and every new binary is checked against this set.

This isn't just theory for me. Before building CelesTLSH, I was a lead detection engineer at Target, a Fortune 100 company, and helped implement TLSH detection scale, across hundreds of thousands of endpoints. My perspective is shaped by seeing this work in a both massive enterprise and a large number of smaller/midsize environments that might be more in align with an average MSSP. You can read about the work at Target in a blog post they published after I left.^[8]

Nuance in False Positives

Anecdotally, my findings differ slightly from the paper's. The false positive (FP) rate for TLSH is much higher than the distance comparison charts from its creators would have you believe. However, I don't think the FP or false negative (FN) rates are high enough to write off fuzzy hashing, and would be a mistake to dismiss these algorithms from your detection program.

The key is nuance.

First, it matters what you scan. The researchers compared malicious binaries to all known-good binaries on various operating systems. This is like a full traditional antivirus file system scan, comparing every file on disk against a threat database, which naturally increases the odds of a false positive.

In contrast, many modern EDRs only evaluate binaries that have actually executed. This is how LimaCharlie's BinLib works, and it only captures one unique copy of each file. I've been consistently surprised to see fewer than ~50,000 unique binaries a day across all monitored organizations—a number I expected to be much higher. A single Windows OS can have nearly that many binaries on disk. By limiting TLSH enrichment to only files that execute, you dramatically reduce the comparison surface and the potential for false positives.

Second, context is everything. A TLSH match rarely exists in a vacuum; it almost always comes with other file metadata that you can use to increase detection fidelity. For example:

Correlate with other hashes. Requiring a TLSH match to occur with a matching Imphash^[9] or Telfhash^[10] significantly reduces false positives. This is the method most likely to cause false negatives, as many tools have built-in Imphash evasion, but it can be effective for specific tools with low Imphash diversity.
Compare like with like. Only compare hashes from similar file types. For example, if your malicious hash comes from an ELF binary, only check it against other ELF binaries.
Be targeted. The smaller your "known bad" list is, the lower your chance of a false positive. Focus on your detection gaps. If you struggle to detect a tool like BruteRatel, TLSH can provide another layer of coverage.
Filter by Code Signing Signature. A legitimate Microsoft signature is a strong signal of a benign binary. While attackers can steal certificates, a stolen Microsoft cert burned in your environment means you're dealing with a sophisticated threat far beyond a fuzzy hash alert. For most scenarios, you can filter out properly signed Microsoft binaries to reduce noise.
Leverage location. Certain directories (C:\Users\Public, %APPDATA%, /tmp) are abused more than others. Limiting TLSH checks to files executing from these high-risk locations can greatly improve fidelity.^[11]^[12]^[13]
Check OriginalFileName's. Legitimate Microsoft binaries often contain an OriginalFileName metadata field.^[14] Attackers rarely set this value, and its presence can be a strong signal of legitimacy.
Maintain a SHA256 allow list. In a model like LimaCharlie's BinLib where only unique binaries are stored, you only need to check a file against the allow list once. This prevents the list from growing uncontrollably and makes it cheaper and more performant than using massive external databases.
- Note: I tried using the NIST National Software Reference Library (NSRL)^[15] as an allow list. It proved less effective than I hoped, as hashes for things like Windows updates were not updated quickly enough. The database is also massive, making it expensive to store and query. It was cheaper, easier, and faster to maintain my own set.

The Future is Enrichment

There are many ways to increase TLSH detection fidelity. When used in tandem, these techniques make it a viable and powerful tool, which is why I don't entirely agree with the paper's broad conclusion about its unsatisfactory performance.

However, as I argued in a previous post and earlier in this one, the best use of fuzzy hashing isn't for standalone alerts. This brings us to the same conclusion the researchers reached:

Overall, the algorithms performed poorly in malware family classification. We suggest using them only as an assisting factor in more sophisticated classification models"

Which is precisely what makes this a wonderful research paper. The authors took a real problem, high lighted the weaknesses of it, and managed to accurately point practictioners to a method that is more likely to lead to increased detection.

The future of fuzzy hashes is enrichment. The more context you can apply to an event, the higher the fidelity of your detection model. In that role, I believe fuzzy hashes play a unique and necessary part in a modern detection program.

I hope to explore other uses for TLSH soon, such as in threat intelligence programs to find new malware variants that are evading current signatures. This, in turn, can lead to entirely new, non-TLSH-based detections. I look forward to hearing your feedback!

Sources

Subscribe to our newsletter and receive access to Magonia Observatory (Coming soon!)

No spam, no sharing to third party. Only you and me.

Paper Highlights

The Role of Hashes in Modern Detection

Why Fuzzy Hash Enrichment is Still Out of Reach

My Detection Environment

Nuance in False Positives

The Future is Enrichment

Sources

Similar topics

A.I. Cybersecurity Tool Marketing: Insanity vs Reality

"That Can be Evaded" and the Imperfect Detector

Maximizing the Value of Indicators of Compromise and Reimagining Their Role in Modern Detection

Why the EDR Telemetry Project is Misleading

Announcing CelesTLSH CLI: A Lightweight Tool for TLSH Hash Analysis