# there are n methods for anomaly detection. ali engineers have come up with them

Xiaoji Guide: Internet black production is prevalent, and its cheating means emerge in endlessly, which leads to the reduction of advertising effect and the explosion of APP promotion cost. Accurate identification of cheating is the ardent expectation of Internet companies and advertisers. Today we will discuss anomaly detection from time series, statistics, distance, linear method, distribution, tree, graph, behavior series, supervised machine learning and deep learning model.

By Li Weibin, Hu Yi and Wang Hao

## background

Outlier detection (outlier detection), also known as outlier detection, is a detection process to find out the objects whose behavior is different from the expected objects. These detected objects are called outliers or outliers. Outlier detection is widely used in production and life, such as anti fraud of credit card, industrial damage detection, anti cheating of advertising click, etc.

Outlier is a data object, which is obviously different from other data objects. As shown in Figure 1 below, the points in N1 and N2 areas are normal data. However, the O1, O2 and O3 areas far away from N1 and N2 are abnormal points.

Figure 1. Example of outliers

One of the difficulties of anomaly detection is the lack of ground truth. The common method is to use unsupervised method to mine abnormal samples first, and then use supervised model to merge multiple features to mine more cheating.

Recently, a variety of algorithms are used to mine outliers. The principle and application scenarios of the anomaly detection algorithm are introduced from different perspectives. Considering the particularity of business, this paper does not involve the feature details.

## 1.1 moving average (MA)

Moving average is a common tool to analyze time series. It can filter high frequency noise and detect abnormal points. According to different calculation methods, the commonly used moving average algorithms include simple moving average, weighted moving average and exponential moving average. Suppose the moving average time window is t, and there is a time series:

### 1.1.1 simple moving average (SMA)

It is easy to see from the above formula that the mean value of the historical value can be used as the prediction value of the current value. In the scenario where the value of the sequence fluctuates little with time, if the difference between the above moving mean value and the real value at that time exceeds a certain threshold value, the value at that time is abnormal.

Apply to:

a. The noise data is smoothed, that is, the moving average is used to replace the current time value to filter the noise;

b. Forecast future value.

### 1.1.2 weighted moving average (WMA)

Because the simple moving average gives the same weight to all data points in the window, it is not sensitive to the latest data in the near future, and the prediction value has hysteresis. According to this idea, the natural idea is to give more weight to the recent data and lower weight to the far data in the window when calculating the moving average, so as to catch the recent changes faster. The weighted moving average and the exponential moving average are obtained.

Weighted moving average is more sensitive to recent changes than simple moving average, and its lag is less than simple moving average. But because only linear weight attenuation is used, weighted moving average still has some lag.

### 1.1.3 exponential moving average (EMA)

The empirical moving average (EMA) is similar to the weighted moving average (WMA), but the difference is that the weight of each value decreases exponentially, while it decreases nonlinearly. In addition, in exponential decay, no matter how far ahead the data, the coefficients of the data in this period will not decay to 0, but only approach to 0. Therefore, the index moving average is actually an infinite series, that is, no matter how long the data is far away, it will play a certain role in calculating the current index moving average value, but the weight of the data far away from the current is very low. In practical application, the exponential moving average of time t can be obtained as follows:

Where is the attenuation degree of weight, and the value is between 0 and 1. The larger the value, the faster the past observation decays.

## 1.2 year on year and month on month

Figure 2. Year on year and month on month

The calculation formula of year-on-year and link ratio is shown in Figure 2. It is suitable for the scene with periodic data. For example: 1. Monitor the month on month and year-on-year of dau of app, and detect the rise or fall of Dau in time; 2. Monitor the month on month and year-on-year of real-time ad Click and consumption, and detect the change in time. When the above ratio exceeds a certain threshold value (refer to part 10 for threshold value), it is determined that there is an exception.

## 1.3 abnormal detection of time series index (STL + GESD)

STL is a single dimension anomaly detection algorithm for time indicators. The general idea is:

(1) Firstly, STL time series decomposition is performed for indicators to obtain seasonal, trend, and recurrent components, as shown in Figure 3; (2) anomaly detection is performed for trend + recurrent components with GESD (generalized extreme studied deviate) algorithm; (3) to enhance the robustness to outliers, mean, STD and other statistics in GESD algorithm are used as media, Mad (media absolute deviation) replacement; (4) exception sub output: abnormal score = (value - media) / mad, value is the current value, and media is the median of the sequence. A negative score indicates an abnormal decline and a positive score indicates an abnormal increase.

Figure 3. STL decomposition example

## 2.1 single feature and Gaussian distribution

If the variable x follows the Gaussian distribution:, then its probability density function is:

We can use the existing sample data to predict the population. The calculation method is as follows:

## 2.2 multiple uncorrelated features are consistent with Gaussian distribution

Suppose that the data set shape of N dimension is as follows:.

If every variable is Gaussian distribution, then the mean and variance of each dimension can be calculated

, specifically, for, you can calculate:

If there is a new data, the probability can be calculated as follows:

## 2.3 multiple features are correlated and conform to multiple Gaussian distribution

Assuming that the n-dimensional data set and each variable conforms to the Gaussian distribution, the covariance matrix of the n-dimensional mean vector can be calculated:

If there is a new data, the probability can be calculated:

## Mahalanobis distance

For the data set D of a multidimensional column vector, if it is the mean vector, then for any object in the data set D, the Mahalanobis distance from to is:

Where is the covariance matrix. The values can be sorted. If the values are too large, then the points can be considered as outliers.

## 2.5 box line diagram

For example, when the data distribution does not conform to the Gaussian distribution, this method can be used. The first quartile Q1 (25%) and the third quartile Q3 (75%) need to be calculated first. Make IQR = q3-q1, and then calculate the boundary points of abnormal values Q3 + λ * IQR and Q1 - λ * IQR. Generally, λ is taken as 1.5 (similar to that in normal distribution, as shown in Figure 4 below:

Figure 4. Schematic diagram of box line diagram algorithm

## 3.1 angle based outlier detection

Figure 5. Point sets and angles

As shown in Figure 5 above, there are now three points x, y, Z, and two vectors. If the angle changes slightly for any different points y, Z, then point x is an abnormal point. The angle can be easily obtained by cosine angle formula

D is the point set, then for any different point, the variance of all angles of point x is:

The above variance of outliers is small. The time complexity of the algorithm is suitable for the scene with small amount of data n.

## 3.2 outlier detection based on KNN

If D is a point set, dist (k, x) is the sum of the distances of its K nearest neighbors for any point. The larger the dist (k, x), the more abnormal the point is. The time complexity is where n is the size of the amount of data.

## 4. Linear method (matrix decomposition and PCA dimensionality reduction)

The main idea of matrix decomposition based outlier detection method is to use PCA to find outliers that violate the correlation between data. In order to find these outliers, the algorithm based on PCA will project the data from the original space to the principal component space, and then from the principal component space to the original space. For most data, if only the first principal component is used for projection and reconstruction, the error after reconstruction is small; but for outliers, the error after reconstruction is relatively large. This is because the first principal component reflects the variance of normal points and the last principal component reflects the variance of abnormal points.

Suppose x is a p-dimensional data set with n samples, and its covariance matrix is. Then the covariance matrix can be decomposed into:.

Where p is a dimensional orthogonal matrix, each column of which is the eigenvector of. D is a dimensional diagonal matrix containing eigenvalues. In graph, a feature vector can be regarded as a line in 2D plane or a plane in high dimensional space. The eigenvalue corresponding to the eigenvector reflects the stretch degree of the batch of data in this direction. Usually, the eigenvalues in the eigenvalue matrix D are sorted from large to small, and each column of the eigenvector matrix P is adjusted accordingly.

The projection of data set X in the principal component space can be written as y = XP. Note that the projection can only be made on some dimensions. The matrix after the principal component projection of TOP-J is:.

Among them is the first j column of matrix P, that is to say, a matrix of dimension. Is the first j column of matrix Y, is a dimensional matrix. In the same way, the data set after reconstruction is mapped from the principal component space to the original space.

One is the data set reconstructed by using the principal component of TOP-J, which is a dimensional matrix. As shown in Figure 6:

Figure 6. Schematic diagram of matrix transformation

The outliers of defined data are divided into:

It represents the proportion of TOP-J principal component to all principal components, and the eigenvalues are arranged in the order of large to small. So it's incremental, which means that the larger J is, the more variance will be counted in, because it's the sum from 1 to J. Under this definition, the first principal component with the largest deviation gets the minimum weight, and the last principal component with the smallest deviation gets the maximum weight 1. According to the nature of PCA, the outliers have larger deviation in the last principal component, so there will be larger outliers.

## 5. distribution

That is, to compare the distribution of a certain characteristic of the reference flow and the flow to be detected.

## 5.1 relative entropy (KL divergence)

Relative entropy (KL divergence) can measure the distance between two random distributions. When two random distributions are the same, their relative entropy is zero. When the difference between two random distributions increases, their relative entropy will also increase. So relative entropy can be used to compare the similarity of two distributions. If it is the value of two probability distributions, the corresponding relative entropy is.

## 5.2 chi square test

Chi square test uses test statistics to compare the difference between the expected results and the actual results, and then obtains the probability of the actual results. Where o represents the observed value and e represents the expected value. This test statistic provides a measure of the difference between the expected value and the observed value. Finally, according to the set significance level to find the chi square probability table to determine.

## 6. Trees (isolated forest)

Figure 7. Iforest test results

The isolation forest assumes that we use a random hyperplane to cut the data space, and two subspaces can be generated each time. Then we continue to use a random hyperplane to cut each subspace, and loop it until there is only one data point in each subspace. Those clusters with high density need to be cut many times so that there is only one data point in the subspace, but those subspaces with low density points are quickly cut into only one data point. As shown in Figure 7, the black point is an abnormal point and stops in a subspace after being cut several times; the white point is a normal point and the white point is focused in a cluster. The boundary of the anomaly detected by the isolated forest is the red line in Figure 7, which can correctly detect all the black outliers.

As shown in Figure 8, iforest is used to cut 4 data. The height of B and C is 3, the height of a is 2, and the height of D is 1. D is the first to be isolated, which is most likely to be abnormal.

Figure 8. Iforest cutting process

## 7.1 maximum connection diagram

In undirected graph G, if there is a path from vertex a to vertex B, then a and B are connected; in graph G, there are several subgraphs, in which all vertices of each subgraph are connected, but there is no Vertex connection between different subgraphs, then these subgraphs of graph G are called the most connected subgraphs.

As shown in Figure 9, device is the device ID, MBR is the member ID, and the edge between nodes indicates that there are corresponding members on the device who have logged in. It is easy to see that device_1, device_2, device_3, and device_4are the same person. They can be used to judge cheating according to the scene, which is often used to mine Gang cheating.

Figure 9. Maximum connection diagram results

The precondition of maximal connected graph is that every edge must be believed. Applicable scenario: find all connectivity relationships. When there are unbelievable edges in the data, we need to remove the dirty data first, otherwise it will affect the effect of the maximum connection graph.

## 7.2 label propagation clustering

According to the topological structure of the graph, the clustering algorithm of label propagation graph divides the subgraphs so that there are more nodes in the subgraphs and less connections between subgraphs. The basic idea of the label propagation algorithm is that the label of a node depends on the label information of its neighbor nodes, and the degree of influence is determined by the similarity of the nodes, and the stability is achieved through the iterative update of the propagation. The nodes in Figure 10 will be divided into two subgraphs after the label propagation clustering, in which nodes 1, 2, 3 and 4 belong to one subgraph and nodes 5, 6, 7 and 8 belong to one subgraph.

Figure 10. Graph structure of label propagation clustering algorithm

There can be a few connections between subgraphs of tag propagation clustering. Applicable scenario: high cohesion and low coupling between nodes. In Figure 10, one subgraph is obtained by using the maximum connection graph, and two subgraphs are obtained by using the label propagation clustering.

## 8. Behavior sequence (Markov chain)

As shown in Figure 11, users have five behavior states on the search engine: page request (P), search (s), natural search results (W), ad Click (o), page turning (n). There are transition probabilities between states. A chain composed of several behavior states can be regarded as a Markov chain.

Figure 11. User behavior state diagram

The state transition matrix is obtained by counting any two adjacent states in the normal behavior sequence and calculating the probability of each state transferring to any other state. For each user behavior sequence to be detected, it is easy to get the probability value of the sequence. The larger the probability value, the more like the normal user behavior.

## 9. Supervised model

All of the above methods are unsupervised, and their implementation and understanding are relatively simple. However, because some methods use fewer features each time, in order to intercept cheating in an all-round way, more strategies need to be maintained; in addition, the effect of combining multiple features of some of the above methods depends on human experience. The supervised model can automatically combine many features and has stronger generalization ability.

## 9.1 machine learning model gbdt

Sample: use the cheating sample mined by the previous unsupervised method as the training sample. If there are still few cheating samples, use smote or Gan to generate cheating samples. Then the gbdt model is trained and the effect of the model is evaluated with the transformed data.

## 9.2 wide & deep learning model

Wide & deep extracts the wide features and deep features respectively, and then merges them together for training. The model structure is shown in Figure 12. Wide refers to LR of high dimensional feature and feature combination. LR is highly efficient, easy to scale and interpretable. If the feature combination is strengthened, it will play a memory role in the judgment of the model. But the opposite generalization is weak.

Deep is based on the feature of neural network free combination mapping, which has strong generalization. In essence, the deep part mines some more general features of sample features and then uses them for judgment, but there is a risk of over generalization.

The algorithm uses the combination of two features to balance memory and generalization.

In order to further increase the generalization ability of the model, we can use the cheating samples mined by the previous unsupervised method as training samples to train the wide & deep model to identify cheating.

Figure 12. Wide & deep model

## 10.1 common selection threshold

All the above methods need to calculate the abnormal threshold value. We can use the following ideas to select the threshold value first, and then transform the data to verify the rationality of the threshold value.

a. Unsupervised method: use quantile to set threshold and find inflection point of distribution curve of historical data;

b. Supervised model: look at the quasi calling curve of verification set

## 10.2 non Gaussian distribution to Gaussian distribution

Some features do not conform to the Gaussian distribution, so we can make them conform to the Gaussian distribution by some function transformation, so as to use the above statistical methods. Common transformation functions: where C is a non negative constant; C is a fraction between 0-1.

## reference:

[1] Charu C, Aggarwal, et al. Outlier Analysis Second Edition, Springer.2016

[2] Varun Chandola, Arindam Banerjee, et al. Anomaly Detection: A survey,ACM Computing Surveys. 2009

[3] Kalyan Veeramachaneni, Ignacio Arnaldo, et al. AI2: Training abig data machine to defend, In Proc. HPSC and IDS. 2016

[4] Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou, et al. Isolationforest, ICDM. 2008

[5] Cheng H T, Koc L, Harmsen J, et al. Wide & Deep Learning forRecommender Systems, ACM Computing Surveys. 2016

[6] SMOTE: Synthetic Minority Over-sampling Technique, JAIR. 2002

"More dry goods, more gains"

Alibaba search recommendation system has been upgraded again!

Under the new retail wave, how to digitize the actions of people and goods?

Revealing mother Ali's intelligent matting algorithm

Focus on machine intelligence

Grasp the possibility of the future