Hacking Book | Free Online Hacking Learning


internal threat detection based on multi domain information fusion

Posted by chiappelli at 2020-02-27

*Original author: Mu Qianzhi

Written in front

Xiaosheng would like to salute freebuf's friends for their old age. I wish you a happy New Year of the monkey and a prosperous future!

As the first part of the year of the monkey, we continue the theme of "internal threat detection" last year and introduce information fusion technology in the field of internal threat detection.


1. Why is information fusion needed?

2. Experimental data

3. Blend in attack detection

4. Unusual change attack detection

5. Experimental results

6, summary

7. References

1、 Why is information fusion needed?

The internal threat detection we talked about today can be traced back to the earliest "active intrusion detection" research, but the difference is that the focus is internal rather than external.

The data set used in threat detection usually comes from multiple sources, or we say there are many types, such as HTTP data that records the user's network behavior and logon data that the system logs on. We regard it as two types of data, or data from different domains. Today we are going to talk about how to use these multi domain data when building a classifier.

Add the behavior data that you have collected from multiple domains of the user through the sensor, such as:

Logon Data + Device Data + File Data + HTTP Data ... ...

So how do we build the initial data set from this data? A naive idea is to connect these different types of data directly as above, but there are many problems in this way.

First of all, the value ranges of data values in different fields are often different, which is natural, but the problem is that if they are directly used as features in the construction of classifiers, the effect of some fields in the final results will not be obvious. For example, in K-means classification, the feature square with large value range will certainly have the greatest impact on distance, while the role of small value range will It can be ignored, which affects the effect of the final classifier.

Secondly, as a response to the above problems, we can use the normalization method to process the data. The premise of such processing is that we assume that all domains have the same important status for the behavior prediction. As an artificial assumption, it may have deviated from the reality, so it also limits the effect of the classifier.

Finally, leaving aside the two problems mentioned above, simply processing the data of all domains will lead to super long dimensional features, resulting in over fitting or high computational complexity.

Therefore, it is necessary to study the information fusion methods that conform to the data laws.

2、 Experimental data

The experimental data set of the detection method we introduced today is taken from a subset of the Adams project in the United States, which is jointly provided by cert and Carnegie Mellon University in the United States. The following picture:

The generation method of this data set can refer to [4]. The data set is classified according to the published version. Some classifications will have multiple data subsets, such as r4.1-r4.2, etc. the subsequent versions are generally supersets of the previous version, and the content is inclusive. The readme file in each version provides some details of data characteristics, while answer.tar.bz2 contains malicious data segment information in each data set, which is convenient for experimental training or testing.

The following describes the experimental data set. The data set is divided into two parts: one is composed of audit records of 1000 users' computer use, which is mainly used for the experiment of blend in attack detection using multi domain fusion method; the other is from the actual computer use data of 4600 users. The attack data in the two kinds of data sets are inserted into the normal data after analyzing the attack scenario simulation in the real event. Data sets are divided into five categories, such as:

1. Login and logout events;

2. Use of mobile devices, such as USB, to record the mobile device name and type;

3. File access events, that is, the creation, copying, moving, overwriting, renaming and deletion of files; for each record, the file name, path, file type and content to be accessed should be included;

4. HTTP access events mainly record URL, domain name information, activity code (upload or download), browser information (ie, Firefox, chrome) and whether the web page is encrypted;

5. E-mail sending and reading events, recording e-mail address, CC / forward address, title, sending time, text content, attachment information and whether the e-mail is encrypted.

Some of the original samples of the above dataset are as follows.

Readme file:

Logon / off file:

Device file:

Http file:

Email file:

According to the five classifications in the data set, we further statistical analysis to obtain statistical characteristics, as shown in Figure 1 (please note that the "category" here is the same meaning as the "data field"):

The characteristics of each kind of data are some kind of counting sum. For example, the "logons" in the "logon" data indicates the statistics of user login times in a specific time window, while the "logons on user's PC" indicates the login times on the user's own PC in a specific time window, and so on. When we finally use it, we will organize the data features in "days", that is, we will describe the above domain feature set based on (user, day) format.

In the subsequent attack detection, the above data domain features will be used in the detection of "Unusual-change" attack. For "Blend-in" attack, we will construct the features separately.

3、 Blend in attack detection

We first introduce the detection of specific "blend in" attacks. This attack translates as "infiltrate / blend in Internal "means that the attacker obtains the login right of the internal network (steals the account of a legal user), and then attempts to disguise as a legal user to organize internal activities.

In order to detect this kind of attack, we propose the concept of consistency, by which we try to describe the consistency of user behavior in different domains. Before continuing our discussion, let's give two concepts:


This concept may not be as rigorous as the mathematical definition, but the idea behind it is simple: the user's behavior should be reflected in the data of each domain. Naturally, users with similar work and role should be similar in each domain because of similar behavior. This kind of similarity can be expressed as the similarities and differences of belonging groups. For example, if user a and engineer belong to the same group in domain S1, S2, etc., then we can expect that user a in the new domain S3 should also belong to the same group as user in engineer. The "group" here is the concept of cluster in clustering.

For your understanding, we can refer to the figure below, where user ABC starts to be in the same cluster in domain 1, but in domain 2, it is obvious that B and C are the same, while a are not;


The second concept is mainly used for "abnormal scoring" of user behavior, such as scoring + 1 (other scoring mechanisms, of course) when there is a prediction error. According to the consistency degree of all users on domain Si, the penalty coefficient can be adjusted dynamically. For example, if all users have poor consistency on Si, the penalty coefficient is low, otherwise, a larger penalty coefficient will be given, so that once the prediction is wrong, a higher abnormal score will be obtained.

In order to understand a concept clearly, we must understand the purpose of this concept, so we first explain that inter domain consistency is mainly used for scoring mechanism. Then, we need to introduce how to implement this definition in practice?

Here, we follow the steps of method implementation to gradually introduce the appearance of "consistency" in practice.

Blend in detection steps:

1. Clustering the original multi domain data: we use the K-means clustering method to cluster the data in each domain, so as to obtain the virtual "user groups" between each domain, namely clusters;

2. We apply GMM algorithm to the data between each domain, that is, we assume that each cluster between one domain and another is a complex, and then we use MLE to calculate GMM of each domain;

3. According to the GMM model of each domain data, calculate the map (maximum posterior probability) of specific user data, and simply understand that given a user's data record in a certain domain, we can judge the most likely cluster according to GMM;

4. The cluster vectors Cu: Cu1, Cu2, cu3 Cum means for user u, when there are m domains, Cui means the cluster with the largest map in domain I, that is, the cluster most likely to belong to;

5. When we predict the Cui error by cuj (J! = I), an exception will appear, and the exception score will be made. The penalty coefficient is as described above (it needs to be compared with the cluster of user U's group or peer);

6. Three methods can be used for specific application: one is that the feature and the final score are discrete values, i.e. discrete features, discrete evaluation. At this time, Hamming distance is used for scoring, i.e. the predicted correct value is 0, and the predicted error score is + 1; the second is that the feature is discrete value, and the score is continuous value, i.e. discrete features, Continuous evaluation. At this time, the score is essentially a density estimation. The score is to use 1-User cluster to predict the correct likelihood. The third is to use continuous features, continuous evaluation. In this case, the feature is not a cluster using map, but a direct use of map, and the predicted result also becomes a probability value.

Integration of scores

Finally, we can get the user's abnormal scores in each domain. Now we need to integrate these scores. The main method is weighted sum. Here, we mainly use the TF / IDF framework method in the field of document vocabulary frequency, namely term frequency inverse document frequency, to calculate the ratio of the frequency of a word in the document to the frequency of the whole corpus, so as to show the relative importance of the word to the document. Next we give the specific pseudo code to calculate the final weighted score. As shown in Figure 2:

As shown in Figure 2, lines 1-4 calculate the weight of each domain's score, lines 5-7 calculate the weighted score of each domain, and finally calculate the weighted sum of the user's scores in all domains. The final f is the final set of exception scores for each user.

4、 Unusual change attack detection

Adding blend in attack detection does not find the intruder, so we need to analyze the user behavior according to the unusual change. The starting point here is that there will be normal changes in the user's behavior, that is, there is a reasonable deviation. If only a specific user is taken as the standard, many reasonable changes will undoubtedly be judged as exceptions. Therefore, we choose a peer group of users (user group with the same position, role and work task) as the comparison standard. Here, we still give a definition as before:


The premise assumption here is similar to that in definition-2, that is, the change patterns of users with the same tasks and roles should also be similar. Of course, changes are not required to occur at the same time here. We focus on the consistency of change patterns in a long period of time. For example, the changes of user a and peer "engineer" groups in the logon domain are in cluster2 and cluster4, and the changes of "engineer" and a in the email domain are between cluster3 \ 4 \ 5, which means that a's behavior is consistent, otherwise it is inconsistent.

Similarly, we give the figure for your reference and understanding, in which the changes of user ABC in domain 1 are between states 1 and 2, but in domain 2, the changes of user a are between states 1 and 4, which is different from users B and C:

Next, the introduction of detection method is similar to blend in, and we briefly describe it as follows:

1. Clustering the original data: note that this is no longer based on GMM, but directly uses the characteristics of the domain introduced in the [experimental data] section (Figure 1), takes different characteristics as the user's "state" in this domain, and establishes a transition probability matrix QD, each element is QD (CK, Cm), which represents the probability of user's change from state K to state M. the value of probability is calculated by the frequency of state occurrence;

2. Behavior change modeling: This paper briefly introduces the two algorithms used, one is Markov model, the other is rarest change model. We will not explain the details in detail, but directly give the model formula used:

Figure 3.1 Markov model, where s represents the user's abnormal fraction in domain D and PD represents the prior probability of state C0:

Figure 3.2 rarest change model,

3. Information fusion: it is relatively simple here. Directly from the s score set calculated above, select the value with the greatest threat, that is, the value with the smallest s, as shown in Figure 3.3:

5、 Experimental results

In order to verify the effect of our information fusion, we first carry out experiments against the blend in attack, and give the experimental results as follows:

Figure 4: device and file data are more suitable for detecting exceptions, while HTTP and logon are slightly worse:

Figure 5: the significance of anomaly detection using device domain alone is not high

Figure 6: inter domain fusion detection, abnormal significance is very obvious

For the unusual change attack, we use the data set of 4600 users to experiment, and we will find that the cost performance of analysis is significantly improved. The data from July to August shows that only 50% of the user data needs to be sampled to detect all malicious attackers. In the data from September, only 13% of the attackers need to be analyzed, as shown in Figure 7 and figure 8:

We can see that the change likelihood of device and logon domains is very consistent, so the inconsistency of change is very obvious, that is, the anomaly is easier to detect in these two domains; the change likelihood of email sent / received domain is less consistent, which is not conducive to anomaly detection.

Likelihood curve of user behavior change in device domain:

Likelihood curve of user behavior change in logon domain:

Email Sent:

Email Received:

Six, summary

The traditional detection methods focus on the data of a certain domain or the simple splicing of multiple domain data. The methods we introduced today effectively combine the information of each domain through TF / IDF and GMM, which improves the efficiency of specific detection and effectively reduces the false alarm rate. The key of the fusion method is to automatically calculate the weighted sum of the information of each domain through TF / IDF framework, and to use the definition of "consistency" for ingenious fusion, so as to make the information association of each domain work and optimize the detection efficiency.

7、 References

1、M.Salem and S.Solfo. Masqurade attack detection using a search-behavior modeling approach. Columbia University Computer Science Department, 2009.

2、M. Salem, S.Hershkop and S.Stolfo. A survy of insider attack detection research. Insider Attack and Cyber Securtiy: Beyond the Hacker, Springer, 2008.

3、Multi-Domain Information Fusion for Insider Threat Detection, IEEE Security and Privacy Workshops, 2013

*Author: Mu Qianzhi, this article belongs to the original award program of freebuf, which is prohibited to reprint without permission