Fuzzy Hashing

By Luther Heigl –
This research was initially conducted by: Yuping Li, Sathya Chandran Sundaramurthy, Alexandru G. Bardas, Xinming Ou, Doina Caragea, Xin Hu, and Jiyong Jang. Their complete findings can be found at https://www.usenix.org/system/files/conference/cset15/cset15-li.pdf.Hashing is a basic concept within the Information Security field, you can turn to any security professional and ask them to tell you about hashing, and at the least they’ll be able to spit back out one or two hashing functions, and probably a brief overview of how it works. Some will be able to go deeper into the description, and talk about something like the avalanche effect (changing one bit in the data will cause the entire resulting string to be different) or discuss some of the uses of hashing functions. There’s a subset of hashing, however; that many people aren’t aware of, because of how new it is. It’s so new that it consists almost entirely of proof of concept programs, with the first major research only being finished in 2015. This subset is called Fuzzy Hashing, because of how wildly new this concept is, I’m only going to be focusing on education for this post, as it’s an area of Information Security, that I believe will be instrumental in how we handle threats ranging from malware to phishing email detection going forward, and it would benefit the whole field for people to have a good working knowledge of fuzzy hashing.

This paper splits fuzzy hashing functions into two categories, block based hashing (BBH) and context triggered piecewise hashing (CTPH). The primary difference between the two types of functions is based off the process that it uses to generate the fingerprint (the final full hash) for BBH functions they will create a section of the fingerprint only after processing a specific amount of input. For example, if processing the Declaration of Independence, the BBH function would go through the text of the document, and after each 100 characters it would create a hash of those characters, then at the end it would concatenate the hashes together to form the fingerprint. For CTPH functions they process the file until reaching a specific context, usually a checksum matching a predefined value. Using the Declaration of Independence again, we could have the checksum match the value “He has”, then the function would go through the file and create a hash for each section beginning with the phrase “He has” and continuing to the next one, then it would concatenate all those hashes together to form the final fingerprint.

Just from the definitions of CTPH and BBH functions we can begin to paint a picture of why these fuzzy hashing functions will be so pivotal in detecting malware moving forward. Old school hashing just takes the entire file and creates a hash of it in one continuous piece, because of this, it’s easy to tell if the file has been changed at all, but it cannot tell us how much the file has changed, and because of this we can’t use the functions to determine values of similarity between files, so even if we’re given two samples of malware and both of them use slightly modified Zeus code we cannot tell, using hashing, that those files are both malicious. By using the techniques that BBH and CTPH functions use we can create slightly different hashes of the same length and use them to determine the level of similarity between the two fingerprints, and ultimately the two files.

The research paper goes on to list multiple different functions, of both BBH and CTPH classifications, and then identifies some core issues with the algorithms present in the functions. After doing so the authors created their own function, that they feel addresses those issues adequately and proceed to do testing on various malware samples to verify their hypothesis.

For these issues, they’ve identified, they talk specifically about ssdeep, and how its fixed size fingerprint does not allow for meaningful comparison between a smaller file and a significantly larger one. Additionally, the two primary areas they identified in existing functions as problematic are: the algorithm used to compute the distance in two BBH functions, sdhash and mvHash-B, and the results interpretation of certain functions, namely sdhash and ssdeep.

The issue they found within the distance computations is that the distance score ends up being non-deterministic when the order of inputs is changed, and in the case of mvHash-B, the order of inputs swaps if A is longer than B. The biggest issue with this is that by having a non-deterministic function it’s not possible to correctly measure the accuracy of any algorithm using a function like this one. As for the results interpretation, both of those functions claim to be able to be used in two manners. One for similarity analysis and one for fragment identification, the issue here is that they don’t provide separate modes for those, and the results will be ambiguous if it’s not specified. For example, we could have two fingerprints: “abc” and “abcdef”. When doing fragment identification on those two fingerprints it will return as 100%, however; if you’re trying to do a similarity analysis it should return as only 50%.

For addressing the results interpretation area, they suggest simply removing the double functionality from hashing functions, or creating a mode to determine which analysis it is using, and for the distance computation they have created a new algorithm that is included in the paper for future developers to use that they believe solves the issue of inconsistent results.

The remainder of the paper they use to describe the process behind creating a better fuzzy hashing algorithm, and then proving it’s better. To do this they identify four areas that, while not problematic in previous functions, they can modify to do better in the future. Some of those areas are: semantic vs low level features and bloom filters vs feature hashing.

Semantic vs low level features refer to how the algorithms extract features from the code bases, or binary sequences, that it’s given. Comparing low level binary sequences doesn’t give much meaningful information when comparing malware, especially when compared to the amount of information gained from looking at assembly instruction sequences.

Currently many fuzzy hashing functions use bloom filters, which is a bit array of m bits and maps each feature to multiple bits in the array. However; this method is inefficient when compared to feature hashing which maps one feature to one bit, the reason for this is that bloom filters create a lot of excess noise and can lead to bloated fingerprints which hinder comparisons.

The other insights they provide are important as well, but they do a much better job describing the differences than I would.

For this paper the authors created the nextGen-hash algorithm, and then ran performance tests against several other algorithms. They mapped these results out and separated them based on two areas: precision, and recall. Precision was defined as the ability to match malware samples by family and recall is the ability to match malware samples by cluster. I’ve included these results at the end, one thing to note is that the results are divided based on how they approached the separation of the file, specifically whether they gave the hashing function the whole binary or if they fed it pieces of code section one at a time.


Fuzzy Hashing Clustering Performance


Functions taking Whole Binary as Input


Functions taking Code Section as Input




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s