Note Author: zhangyy333
This paper combs the classic work in the field of malicious domain name detection. The malicious domain name in this paper refers to the malicious domain name in a broad sense, not specifically for a malicious domain name, but a general designation of many domain names related to malicious activities. Because of the diversity of malicious activities, the behavior of malicious domain names is also different, which is still challenging at present.
Highly Predictive Blacklisting (2008)
Zhang J, Porras P A, Ullrich J. Highly Predictive Blacklisting[C]//USENIX Security Symposium. 2008: 107-122.
In 2008, this article about the prediction of blacklist is the first one I found about the prediction and detection of malicious domain names. Because of this, there is a big difference between this article and the following articles. It does not use the traffic characteristics, but uses the relationship between the attacker and the target. At the same time, the data used in this article is not the passivedns data [] proposed in 2005. Passivedns data has been widely used in subsequent articles. In fact, from the methods of this article, we can understand why the author did not use PDNS. The idea of relevancy level in this article is that there is a close relationship between the attacker and the users of a specific blacklist. Therefore, through the The known relationship between the user and the attacker is used to infer the existence of the implicit relationship to realize the prediction of the blacklist. In fact, the correlation propagation mentioned in this paper is similar to Markov process, but in this paper, the author uses Google's PageRank to calculate.
Building a Dynamic Reputation System for DNS(2010)
Antonakakis M, Perdisci R, Dagon D, et al. Building a dynamic reputation system for dns[C]//USENIX security symposium. 2010: 273-290.
Notos was proposed in 10 years. Its core idea is that the domain name or its IP is associated with a known malicious domain name, so the domain name may be a malicious domain name.
Data source: the data used by the author in this paper includes passivedns, DNS traffic of ISP recursive resolver collected and traffic provided by Sie. The existing knowledge of domain name comes from simulated running malware, spam and top domain name of alexa.com.
passiveDNS
Sie (security information exchange): it is a project of Internet Systems consortium, but somehow its official website has been closed, https://sie.isc.org/
The system architecture is shown in the figure below.
First, it introduces the applied terms.
For the target domain name D, there are TLD (d), 2LD (d), 3ld (d), which are its top-level domain, secondary domain and tertiary domain.
Zone (d) is a collection of domain names ending in D including domain name D.
$d = {d_1, d_2,..., d_m} $is the domain name set, and a (d) is the IP set corresponding to the domain name in D.
For IP address a, BGP (a) is all IP sets in the BGP prefix of a; as (a) is the IP set in as where a resides.
Two kinds of information data are defined in this paper:
• related historical IP (rhips): mainly including a (d), a (zone (2LD (d))), a (zone (3ld (d))) • related historical domain name (rhdns): IP set corresponding to IP in as (a (d))
The related historical IP extends from the target domain name to the domain name with which it has a relationship. By using its IP as a data set, the target domain name can be described more comprehensively.
From the perspective of as, the aggregation degree of malicious domain names at as level is considered.
Based on the above two kinds of information data, this paper defines three types of features: network-based features, zone based features and evidence-based features.
1. Based on the characteristics of network, the idea is to describe the distribution of network resources between domain name D and corresponding IP operators. For malicious domain name and IP, its life cycle is short, it has high jitter, and it changes more frequently. This paper mainly uses three aspects of characteristics. The number of different BGP prefixes in BGP (a (d)), the number of different countries in BGP, the number of corresponding organizations (a (3ld (d)) and the number of different IP addresses in BGP (a (3ld (d)) and BGP (a (2LD)), the number of different BGP prefixes in BGP (a (d)), and the number of different countries. Different as related to as (a (d)), as (a (3ld (d)), as (a (2LD)) The number of IP registrants in a (d) the number of IP registrants in a (d) the difference in the date of IP registration in a (d), the difference in the number and time of IP registrants in a (2LD)
The 'network based' used in the original text can be seen from the introduction to know the network facilities where the domain name or IP is located, or to see the network distribution from a macro perspective, rather than the traffic related network conditions. Here, we also use the characteristics of registration, but the types are less, and then a paper uses more registration related features.
2. Based on the characteristics of zone, the basic idea is that a group of domain names associated with normal network services have strong similarity, such as google.com and Google wave.com, while malicious domain names are more likely to be generated randomly. There are two main characteristics to measure the difference of domain names in rhdns: • the number of domain names in rhdns with string characteristics, the mean of their length and the mean, median, mean, median, mean, median of 3-gram distribution of standard deviation, median, standard deviation, etc. • the number of TLDs in rhdns with TLD characteristics. The ratio of COM domain names to non.com For example, the mean, median and standard deviation of the frequency of the top-level domain
google.com
googlewave.com
.com
.com
The statistical characteristics of domain name characters are used here, and more relevant statistical characteristics will appear in the following papers.
3. Based on the evidence features, the domain name associated with the known malicious domain name is more likely to be a malicious domain name. The paper also points out that there is no direct relationship between the domain name and malicious domain name as malicious domain name, but only as a part of the characteristics, after all the characteristics of the calculation, the results will be obtained. It mainly uses two features:
• number of malware samples associated with IP address in honeypot feature BGP (a (d)). Number of malware samples associated with IP address in as (a (d)). Number of IP in blacklist in blacklist feature a (d) number of IP in BGP (a (d)) number of IP in blacklist in as (a (d))
From the whole point of view, the features in this paper involve network related features, the characteristics of domain name itself and the association with existing knowledge, and the scope is very wide. Later, the features in this article are almost all these aspects, just from different levels and angles.
EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis (2011)
Bilge L, Kirda E, Kruegel C, et al. EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis[C]//Ndss. 2011: 1-17.
This article was published after notos. The difference is that expose pays more attention to the characteristics of DNS traffic data, instead of using the characteristics of the network in notos.
Characteristics of expose: it is a passive DNS analysis method, which has certain concealment and will not be found by attackers.
The basic idea of expose is that the malicious domain name will change suddenly when it is used, which is different from the normal traffic. This paper is based on the traffic change on the time series to detect.
Data source: DNS traffic provided by Sie
The system architecture is shown in the figure below
In the aspect of feature design, it is divided into four parts:
• time based feature 1. Short life - a short live domain is defined in this paper, which is only queried between t0 and T1. 2. Daily similarity - similarity of daily request mode, such as increasing or decreasing requests at the same interval every day. 3. Repeating patterns - regular repeating patterns. 4. Access ratio- Domain name access situation • multiple to multiple relationship between characteristic malicious domain name and IP based on DNS response 1. Different IP numbers of given domain name in the window 2. Different countries corresponding to IP 3. IP reverse query results 4. Number of IP shared domain names of the target domain name 2. TTL standard deviation 3. Number of different TTL 4. Number of TTL changes 5. Specific The proportion of TTL in the domain - the author found that some malicious domain names have obvious regularity in setting TTL of domain names. Based on the characteristics of domain names, 1. The proportion of numbers in domain names, 2. The length of the longest meaningful string
short-live domain
The characteristics of this paper pay more attention to the rules of the target domain name in the resolution, or some mode generated by its purpose. From the analysis of the network in notos to the analysis of the domain name and its behavior in this paper, the research direction is more targeted and detailed.
Detecting Malware Domains at the Upper DNS Hierarchy (2011)
Antonakakis M, Perdisci R, Lee W, et al. Detecting Malware Domains at the Upper DNS Hierarchy[C]//USENIX security symposium. 2011, 11: 1-16.
It's also an 11 year article, but its focus and method are very different from expose.
Compared with notos and expose, it has new advantages and traffic characteristics, and uses a higher level of DNS traffic.
Data source: DNS traffic of two major domain name registrars and DNS traffic of national top-level domain.ca
.ca
The system architecture is shown in the figure below. The differences between notos and exposure are mainly due to the different data levels.
Notos
Exposure
The features in this paper are mainly statistical features, which are described in a very standard way. Firstly, the symbols used are introduced.
The flow is batched, ${e} {I = 1... M} $. In this paper, one epoch is taken as one day.
The final goal of this paper is to achieve $f (D, e ﹣ I = V ^ I ﹣ d $, and get the domain name D in the ith batch of traffic data mapped to the feature vector v.
Domain name D and corresponding DNS request $q_j $and response $r_j $, tuple $q_j (d) = (t_j, r_j, D, ips_j) $, where $t_j $is epoch, $r_j $is the IP address of the request $q_j $, D is the domain name of the query, $ips_j $is the IP set resolved in response to $r_j $.
There are three characteristics in this paper
1. Requester diversity mainly describes the distribution of domain name hosts. The author points out that the request IP pool of popular domain names is diverse and consistent every day, while that of malicious domain names is diverse but inconsistent. The BGP prefix corresponding to IP in ${R {J} {J = 1... M} $, the mean, standard deviation and variance of as and CC (country code) occurrence frequency are calculated in this paper. 2. There are more infected machines in the large-scale network of requester profile, and the rights of requesters related to malicious domain name are higher than those of normal domain name. Let $C {T, K} $be the different domain names requested by $R {T, K} $in $e {T $, then calculate the weight $W = \ frac {C {T, K} {Max ^ {r} {L = 1} C {T, l}} $for the target domain name in $e {T $, and calculate its mean value, biased and unbiased square deviation and standard deviation. Considering the correlation of historical analysis, we add the weight term to C. There is $WC {t (d) = {C {T, K} * w {T-N, K}} $, where $W {T-N, K} $is the data of the first n epochs of $e {T $in the historical data. 3. The resolved IPS reputation considers the reputation value of domain name resolution IP. When considering the reputation value, the reputation value of BGP and as where the IP is. The number of overlaps is calculated by considering the association degree of the known malware samples, spamhaus block list and white list.
Kopis pays more attention to the relationship between the malicious domain name and the requester. The requester of the malicious domain name is a fixed infected I machine, so its distribution may be diverse, but there may be inconsistencies every day. For the popular domain name, its user base is very large, so the users are consistent.
PREDATOR: Proactive Recognition and Elimination of Domain Abuse at Time-Of-Registration( 2016)
Hao S, Kantchelian A, Miller B, et al. PREDATOR: proactive recognition and elimination of domain abuse at time-of-registration[C]//Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016: 1568-1579.
Predator was proposed in 2016. It pays more attention to the registration features, and enables it to have almost real-time detection capability, greatly reducing the detection delay of malicious domain names.
The main idea of this paper is that attackers need a large number of domain names in order to ensure the flexibility of profit and attack, which leads to some abnormal registration behaviors different from normal users.
The system architecture is shown in the figure below
Data source: change of the largest top-level domain. Com from March to July 2012, and dnza file provided by VeriSign, including registration of new domain name, update of authoritative domain name server and IP address.
.com
There are three parts in this paper
1. The characteristics of domain name data and domain name are directly related to the characteristics of domain name, including those previously used and those proposed by the author. The registrant attacker may prefer to use for some reasons, such as the low price of the registrant, ignoring complaints. The authoritative domain name server attacker also has some rules in using the authoritative domain name server, such as using the self resolving server. The data used in this article is obtained from the zone file within 5 minutes after registering the domain name. • name server IP address and as • registration time, whether the attacker has regularity in the registration time, such as the time of each day, the week of each week • registration generation cycle • ternary model • the proportion of the longest English words • whether it contains numbers • whether it contains "-" domain name length • editing distance with known malicious domain name 2. The part of registration history features is the strong correlation of registration In this paper, we use the related features of the latest registration. • life cycle • free time for re registration • previous registrants • whether the same registrant re registers 3. Batch related characteristics are also the characteristics of domain name registration, looking for the abnormal behavior of registration. This paper uses domain names registered within five minutes on a specific registrant • probability of each group size (composite Poisson distribution) • proportion of life cycle- In this paper, three life cycle stages are defined: brand new: domain name is registered for the first time; re registration: domain name is registered again after termination, which is divided into two types: drop catch registration immediately after termination; reread registration after termination for a period of time. • the aggregation degree of each group of domain names, and the editing distance of domain names within the group
brand-new
re-registration
drop-catch
retread
The biggest feature of this paper is to use the abnormal registration feature to make a judgment before the malicious domain name is enabled.
HinDom: A Robust Malicious Domain Detection System based on Heterogeneous Information Network with Transductive Classification (2019)
Sun X, Tong M, Yang J, et al. HinDom: A Robust Malicious Domain Detection System based on Heterogeneous Information Network with Transductive Classification[C]//22nd International Symposium on Research in Attacks, Intrusions and Defenses ({RAID} 2019). 2019: 399-412.
This article was published on raid by Tsinghua University in 19. It uses heterogeneous graph to detect malicious domain names.
This paper is the first time to apply heterogeneous information network to DNS analysis and malicious domain name detection. The algorithm used in this paper is based on Metapath conduction classification method, which can achieve better results when the data set is small.
Data source:
CERNET2, China Education and research network, traffic captured by Tsinghua node; • tunet, Tsinghua campus network, resolution log of central DNS obtained; • 360 PDNS data set;
CERNET2
TUNET
The basic assumption of this paper is that: domain names with strong association with known malicious domain names are likely to be malicious; attackers can forge domain names, but can not easily change the association between them.
On the basis of the above assumptions, this paper defines three kinds of entities and six kinds of relationships in the data.
Three types of entities are: client, domain name and IP address
6 kinds of relations are;
• client query domain (client query domain) • client segment client (client segment client) • domain resolved IP address (domain resolve IP) • similarity between domain names (domain simple domain) • similarity calculation method: use n-gram to process domain name string into character vector, then use k-means algorithm to cluster and cluster If it is transformed into matrix s, domain CNAME domain, a single domain name corresponds to multiple IPS (IP domain IP)
At the same time, five filtering rules are defined to remove the possible noise in the data
Special domains popular domains: domain names queried by more than 25% of clients large clients inactive clients a small number of IPS
The above figure shows the architecture of himdom, mainly including five modules: data collector, Hin building module, graphics pruning module, meta path combination module and conduction classifier.
HimDom
The data collected by the data collector includes
(1) DNS server log, including source IP, query domain name, time and other fields. Caching DNS server logs is widely used to collect information about "who queried what domain name".
(2) DNS traffic, fields include: NS, MX, TXT, PTR, etc. Due to the sensitive data, it is not open to the public.
(3) Passive DNS data set, PDNS data does not contain client information, only the time stamp of the first and last occurrence of the domain name, and the number of domain name IP resolution pairs during this period.
The Hin building module abstractly represents the six defined relationships, as shown in the following figure
(1) P1 semantics: there are differences in character distribution between benign and malicious domain names. Moreover, the malicious domain names of the same family are similar in character pattern;
(2) P2 semantics: the CNAME domain name of a benign domain name is not malicious, and vice versa.
(3) P3 semantics: the client queries infected by the same attacker are partially overlapped, and the overlapped common clients will not be queried.
(4) P4 semantics: domain names resolved to the same IP address have the same category.
(5) P5 semantics: adjacent customers are vulnerable to the same attack.
(6) P6 semantics: attackers are likely to reuse domain name and IP address resources.
After the Metapath is constructed, the similarity between nodes is measured by using pathsim algorithm. At the same time, considering the different importance of each meta path in the identification of malicious domain names, the author weighted the similarity matrix of each meta path according to a certain weight to form a fused meta path. Laplacian score is used for weight calculation.
In the experiment, hindom can still achieve better classification effect (accuracy: 0.9626, F1: 0.9116) when only 10% of labeled samples are available.
However, misclassification mainly occurs in benign and small sample malicious categories, which is caused by data skew.
At the same time, compared with other methods, when the data noise increases gradually, the accuracy and F1 index of hindom decrease slowly, that is, the anti noise ability is stronger.
This is the first article to apply the graph to the field of malicious domain name detection. The relationship between malicious domain names is not easy to be forged, and it is difficult to eliminate. At present, the theory of graph, such as graph neural network, is also developing rapidly. I believe that there will be more relevant applications in the future.
Security academic circle recruits team-ing. If you are interested in joining the academic circle, please contact secdr.qq.com