Charles Leaver – Why Using Edit Difference Is Essential Part One

Written By Jesse Sampson And Presented By Charles Leaver CEO Ziften


Why are the same techniques being used by cyber criminals all of the time? The easy response is that they continue to work. For instance, Cisco’s 2017 Cybersecurity Report tells us that after years of decline, spam e-mail with malicious attachments is again on the rise. Because conventional attack vector, malware authors usually mask their activities by utilizing a filename similar to a typical system process.

There is not necessarily a connection with a file’s path name and its contents: anybody who has actually aimed to hide sensitive information by giving it a dull name like “taxes”, or altered the extension on a file attachment to get around email rules knows this idea. Malware authors understand this too, and will frequently name their malware to resemble common system processes. For instance, “explore.exe” is Internet Explorer, however “explorer.exe” with an additional “r” could be anything. It’s easy even for experts to overlook this minor distinction.

The opposite problem, known.exe files running in unusual locations, is easy to solve, utilizing string functions and SQL sets.

How about the other scenario, finding close matches to the executable name? The majority of people start their hunt for near string matches by sorting data and visually looking for disparities. This generally works well for a little set of data, perhaps even a single system. To find these patterns at scale, however, needs an algorithmic approach. One recognized strategy for “fuzzy matching” is to use Edit Distance.

Exactly what’s the very best method to computing edit distance? For Ziften, our technology stack includes HP Vertica, that makes this task simple. The internet has lots of data scientists and data engineers singing Vertica’s praises, so it will be enough to point out that Vertica makes it easy to create customized functions that make the most of its power – from C++ power tools, to analytical modeling scalpels in R and Java.

This Git repo is maintained by Vertica enthusiasts working in industry. It’s not an official offering, but the Vertica group is definitely knowledgeable about it, and additionally is thinking everyday about ways to make Vertica better for data scientists – a great space to watch. Best of all, it includes a function to determine edit distance! There are likewise some other tools for the natural processing of langauge here like word stemmers and tokenizers.

By utilizing edit distance on the top executable paths, we can quickly discover the closest match to each of our leading hits. This is a fascinating data-set as we can arrange by distance to discover the nearest matches over the entire data-set, or we can arrange by frequency of the top path to see exactly what is the nearest match to our frequently used processes. This data can likewise appear on contextual “report card” pages, to reveal, e.g. the leading five closest strings for a given path. Below is an example to provide a sense of use, based on genuine data ZiftenLabs observed in a client environment.

Setting a threshold of 0.2 appears to find good results in our experience, but the point is that these can be edited to fit specific usage cases. Did we find any malware? We see that “teamviewer_.exe” (needs to be just “teamviewer.exe”), “iexplorer.exe” (must be “iexplore.exe”), and “cvshost.exe” (ought to be svchost.exe, unless maybe you work for CVS drug store…) all look odd. Considering that we’re currently in our database, it’s also insignificant to obtain the associated MD5 hashes, Ziften suspicion scores, and other attributes to do a much deeper dive.

In this particular real life environment, it ended up that teamviewer_.exe and iexplorer.exe were portable applications, not known malware. We helped the client with more examination on the user and system where we observed the portable applications given that use of portable apps on a USB drive might be proof of naughty activity. The more disturbing find was cvshost.exe. Ziften’s intelligence feeds show that this is a suspicious file. Searching for the md5 hash for this file on VirusTotal confirms the Ziften data, showing that this is a possibly major Trojan infection that could be a component of a botnet or doing something much more destructive. Once the malware was found, nevertheless, it was simple to fix the problem and ensure it stays solved using Ziften’s ability to eliminate and persistently obstruct procedures by MD5 hash.

Even as we develop innovative predictive analytics to discover malicious patterns, it is necessary that we continue to improve our capabilities to hunt for recognized patterns and old tricks. Even if new dangers emerge doesn’t suggest the old ones go away!

If you enjoyed this post, watch this space for part 2 of this series where we will use this approach to hostnames to detect malware droppers and other destructive websites.