Hacking Book | Free Online Hacking Learning


explore machine learning model to ensure account security

Posted by forbes at 2020-03-16


Account security is often one of the first data that enterprises need to ensure security. The reason behind the disclosure of accounts is the lack of account protection.

However, how to ensure the account security is often a problem that enterprises need to face. With the increasingly sophisticated means of attackers, websites need to find effective ways to protect the account security. This article takes the account security construction of Uber team as an example to tell you how to use machine learning to ensure the account security.

The traditional way to protect account security is to train your system and analysts to find out the abnormal behavior from the normal user behavior. Once the abnormal behavior can be found, it can be shielded.

However, in practice, it is very difficult to identify abnormal login, and hackers can use phishing and other means to obtain the password information of normal users for login. Therefore, engineers use multi-layer defense to protect these accounts.

This paper takes Uber team's idea of building account protection system as an example. This protection system includes traditional judgment factors, including rate restriction and heuristic feature rules. The system will apply machine learning model. Analysts can deploy rules to respond quickly to specific attacks. On the contrary, if speed limit and machine learning are used, the impact will be greater.

Before we start, we will introduce the types of machine learning. Machine learning can be divided into the following categories:

Supervised learning learns a function from a given set of training data. When new data comes, the result can be predicted according to this function. The training set requirements of supervised learning include input and output, or features and objectives. The goal of the training set is marked by people. Common supervised learning algorithms include regression analysis and statistical classification.

Compared with the supervised learning, the unsupervised learning has no human labeled results. The common unsupervised learning algorithms are clustering.

Semi supervised learning is between supervised learning and unsupervised learning.

Reinforcement learning learn how to make actions by observing. Each action will have an impact on the environment, and the learning object will make a judgment based on the feedback of the observed surrounding environment.

Deep learning is a branch of machine learning pull-out, which attempts to use multiple processing layers including complex structures or multiple nonlinear transformations to abstract data at a high level.

Machine learning model

For Uber, it's a challenge to build a machine learning model, because the two sides of attack and defense are constantly fighting, and the cheater will try to change the attack mode to deal with the model.

Here are two models created by Uber to identify suspicious login behaviors. Once an exception is identified, Uber allows users to authenticate using two factor authentication. If the login behavior is particularly suspicious, you can also take the initiative to protect users by some operations, such as resetting passwords, notifying users, etc.

Semi supervised model

As mentioned above, machine learning has various learning methods, including supervised learning and unsupervised learning. Semi supervised learning (SSL) is a key problem in the field of pattern recognition and machine learning. It is a learning method combining supervised learning with unsupervised learning. Semi supervised learning uses a large number of unlabeled data, as well as marked data, for pattern recognition. When using semi supervised learning, it will require as few people as possible to work.

One way to detect abnormal login is to check the characteristics of the attacker's login, such as IP address, etc. Hackers have different conditions. Some only use a few IPS to attack, while others use tens of thousands. The common people who own those IPS are web host service providers, tor agents, or botnets made up of personal devices that are invaded by viruses (refer to Mirai).

This is the IP address based clustering model used by Uber. They use PCA to reduce the dimension to two dimensions, making the data more convenient and visible. The dots in the picture are all IP addresses. The two figures above and below show the landing situation in 2016 and at the end of 2017 respectively. The characteristics of these clusters are actually quite different. This is because the attack and defense confrontation caused the attacker to change the landing characteristics in a year.

Uber uses a semi supervised approach to categorize suspicious IP addresses. Each IP is labeled. Then, according to the efficiency of the algorithm to distinguish good IP from bad IP, the features are adjusted.

One thing to note here is that it's best to use features that are difficult for attackers to control, which can make the model more accurate. In addition, Uber selects 10 influential features, and then uses DBSCAN clustering algorithm to find IP clustering.

Once the machine learning model returns clustering features, the analyst can calculate some specific indicators for each cluster, and identify whether the cluster is good or bad. In this way, analysts do not need to implement the labeling of all IP addresses. If the labels in some new clusters are not enough to judge the quality of IP, Uber will prompt for two factor authentication, or judge by manual audit.

Unsupervised learning

The semi supervised method needs to label the samples in advance, and the defense is passive. It can only be defended when the attacker launches the attack. In contrast, unsupervised learning method does not need to be labeled, so it can defend attacks more actively. Uber will only use the user behavior training model of normal users, and then judge any other inconsistent behaviors as exceptions. Unless the attacker is launching a targeted attack (which requires funds for ordinary attackers), they do not know the user behavior of normal users, so it is difficult to bypass the detection model.

For example, if a user has been to Brazil before, the probability of going to India is not very high. The security team built a deep learning model to learn the relationship between cities. The model will take the places the user has traveled before and the orders for food as the input data, and the model will predict where the user will go next.

For large data and high-dimensional data sets, the effect of deep learning is much better. So deep learning is more suitable to solve these problems than traditional machine learning. Compared with traditional machine learning, neural network has more parameters, so in order to fully train, it needs to feed more data. In addition, traditional machine learning is difficult to deal with data with uneven length, while deep learning has a cyclic neural network layer, such as LSTM (long short term memory), which is better at dealing with these data.

The Uber team investigated embedding, which represents the city. This is a low-dimensional mapping, using the distance between cities to represent the possibility of users traveling between cities. If you only use latitude and longitude, you can't catch the trend of users traveling between cities. The team applied the word2vec algorithm commonly used in natural language processing (NLP) to the algorithm, so as to gain insight into the relationship between cities through information outside longitude and latitude. The team then used the internal GPU infrastructure to train the model through hundreds of millions of training sets.


This paper introduces the method of using semi supervised learning and unsupervised learning algorithm to identify abnormal users. In particular, unsupervised learning, with a broader versatility, can identify abnormal users generated by various devices.

But in fact, the struggle to protect the security of accounts is far from over. At the 2017 Tencent Security International Technology Summit held in August this year, Yang Yong, general manager of Tencent security department, mentioned that the technology investment of black production in this area is also huge. Hei Chan is also using AI technology. They even have a verification code recognition system using neural network. The accuracy can reach an amazing 95%, covering 80% of the verification codes on the market. This requires Internet companies to constantly improve barriers, keep up with the pace, and use the latest technology to ensure the safety of users.

*Reference source: Uber security, author: vulture, reprint please indicate from freebuf.com