Charles Leaver – Why Using Edit Difference Is Essential Part Two

Written By Jesse Sampson And Presented By Charles Leaver CEO Ziften

 

In the very first about edit distance, we took a look at searching for destructive executables with edit distance (i.e., how many character edits it takes to make 2 text strings match). Now let’s take a look at how we can utilize edit distance to hunt for harmful domains, and how we can develop edit distance features that can be integrated with other domain name functions to identify suspect activity.

Here is the Background

Exactly what are bad actors doing with destructive domains? It may be merely utilizing a similar spelling of a typical domain name to fool reckless users into viewing advertisements or getting adware. Legitimate websites are gradually picking up on this method, in some cases called typo squatting.

Other destructive domains are the result of domain generation algorithms, which might be used to do all sorts of dubious things like avert countermeasures that block recognized compromised sites, or overwhelm domain servers in a dispersed DOS attack. Older variations use randomly-generated strings, while more advanced ones include tricks like injecting common words, further confusing protectors.

Edit distance can help with both use cases: let’s see how. Initially, we’ll leave out common domains, since these are typically safe. And, a list of regular domains supplies a standard for spotting abnormalities. One excellent source is Quantcast. For this conversation, we will stick to domains and avoid sub domains (e.g. ziften.com, not www.ziften.com).

After data cleaning, we compare each prospect domain name (input data observed in the wild by Ziften) to its prospective next-door neighbors in the same top level domain (the tail end of a domain name – classically.com,. org, and so on now can be almost anything). The standard task is to discover the nearby next-door neighbor in regards to edit distance. By discovering domains that are one step away from their nearby next-door neighbor, we can easily spot typo-ed domain names. By discovering domain names far from their neighbor (the stabilized edit distance we introduced in the initial post is beneficial here), we can also discover anomalous domains in the edit distance area.

What were the Outcomes?

Let’s take a look at how these outcomes appear in real life. Use caution when browsing to these domain names given that they might include harmful material!

Here are a couple of possible typos. Typo-squatters target well known domains because there are more possibilities somebody will visit. Numerous of these are suspect according to our risk feed partners, but there are some false positives too with cute names like “wikipedal”.

Here are some odd looking domain names far from their neighbors.

So now we have produced two helpful edit distance metrics for hunting. Not just that, we have three features to potentially add to a machine-learning design: rank of nearby next-door neighbor, range from neighbor, and edit distance 1 from neighbor, suggesting a danger of typo tricks. Other functions that might be used well with these include other lexical functions like word and n-gram distributions, entropy, and string length – and network functions like the total count of unsuccessful DNS demands.

Simplified Code that you can Play Around with

Here is a simplified version of the code to have fun with! Created on HP Vertica, however this SQL will probably run with many sophisticated databases. Keep in mind the Vertica editDistance function may differ in other implementations (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).