Acknowledgement
Vsrc thanks mcvoodoo, a small partner in the industry, for contributing excellent original articles. Vsrc welcomes the contribution of excellent original articles. Once the excellent articles are adopted and released, they will be presented with good gifts. We have prepared rich prizes for you!
(the final interpretation right of the activity belongs to vsrc)
Twelve
Preamble
Everyone has been using various technologies and mechanisms to detect security threats, from the early SOC to Siem, to the big data-driven ueba. Ueba analyzes users and entities through machine learning. Whether the threat is known or not, it also includes real-time and offline detection methods. It can get an intuitive risk rating and evidence analysis, so that security personnel can respond to exceptions and threats.
Twelve
Backsight
Malicious detection is generally judged by setting rules for abnormal behaviors, and also uses various defense devices to monitor traffic, such as IDS, WAF, etc. However, the scalability of these systems is always a problem. When the traffic suddenly grows, it is difficult to keep up with it. At the same time, the visibility of traffic based detection is not enough. In the access layer of the switch, due to the cost, it is impossible to detect any more, and it can not be assisted by the context of other network segments. If the attack is clever, it can completely bypass these devices.
Software is another way to monitor the data between devices on the terminal, but the scalability and visibility of the software are not satisfactory.
In fact, if the device and user are trusted, many existing methods can not detect them. The disadvantage of traditional security products is that they can't detect unknown threats and internal threats, can't expand, and can't deal with big data. And attackers can always find ways to bypass traditional security technologies, such as rule driven malicious file signature, sandbox. In addition, with the increase of data volume, the manual analysis is more and more slow, and the response speed is too long. For example, the kill chain, from invasion to horizontal movement to penetration, is difficult for traditional security products to correlate and respond appropriately, and is prone to be flooded by a large number of false positives.
Ueba is relatively insightful and extensible. Simply speaking, ueba is driven by big data, and uses machine learning method for security analysis. It can detect advanced, hidden and internal threat behavior analysis technology without signature or rules. These analysis techniques include machine learning, behavior modeling, classification, peer group analysis, statistical model and graphic analysis. The analysis combines the scoring mechanism, the comparison of activities, and finally realizes the detection of anomalies and threats. At the same time, ueba also includes threat visualization and visual analysis across the kill chain.
Therefore, one of the characteristics of ueba is to be able to process a large number of data from multiple data sources. These data sources have different formats and fast rates. Subsequent data processing can extract valuable information from structured / unstructured data sources. Data processing is the extension of data mining and prediction analysis, and also a separate discipline: knowledge discovery and data mining. Data sources are divided into real-time and offline, real-time continuous monitoring and analysis of incoming data, generally not considering historical data and third-party data association, because of the impact on performance.
Ueba detects "abnormality", which means that there is a change in expected behavior, and the change is not necessarily a threat, for example, the promotion activities will bring about change. The abnormal expression needs attention, and the threat judgment is given after evaluation. The threat index represents the gradual increase of attention. For example, through the data source, 100 exceptions are generated, further aggregated into 10 threat features, and 1-2 threat indicators are generated again. This data expansion method enables ueba to detect exceptions and threats.
In the context of machine learning, historical data and third-party data can be used to improve the model, but these data are much larger than real-time, so they are also slower. Therefore, historical data is generally not used in real-time processing, that is, real-time data is also used. After real-time detection, actions need to be triggered, such as IP blocking, account locking, process killing, false alarm release, etc. these actions can not be directly intercepted, but provided for manual decision-making, feedback of these decisions, and further update and improve the model.
Offline processing can find more subtle exceptions and threats. There are short-term decision constraints in real-time processing, and offline processing is much more relaxed in this respect. The real-time processing data is filtered, and the complete data is stored offline, so offline can have more attributes, span time and geography and other information.
Overall framework of the system
In the overall view, the underlying layer is the infrastructure layer. Considering the cost, various virtualization can be used. Above the infrastructure layer is the software layer, which generally includes Hadoop, spark, storm, etc. Hadoop stores and processes large cluster data sets in a distributed way. Storm is a distributed real-time computing engine that records and calculates data streams in real time. Spark is a large-scale data processing engine that collects events together for batch processing.
Then there is the intelligent layer. The main functions of this layer are the security semantic layer and machine learning layer. The semantic layer extracts the transformation loading data, supplies the downstream consumption, and the machine learning layer is the consumer of the semantic layer.
Above the intelligent layer is the application layer, and the output of machine learning is analyzed by the application layer.
This diagram is a conceptual diagram in the system. The data receiving module is a logical component responsible for receiving data from data sources, including various communication APIs. ETL prepares the data and preprocesses the data of the data receiving module, such as adding metadata, so as to make the downstream effective consumption.
ETL processes the data well, transmits real-time analysis in real time, and also transmits it to batch processing offline analysis through batch processing path mechanism. Real time data is streamed, recorded one by one, and offline data is transferred in batch in fixed time window, so offline analyzer can further obtain additional historical data, real-time analyzer results and process data.
The figure above shows the overall architecture. The data source receives data, and the log data, such as user login and access events, can be generated from the operating system and security system (such as firewall and security software). Application data sources, according to different situations, have push / pull or hybrid mechanisms, where data such as HR system, CRM system, etc. The last category is network data sources, such as traffic, which also includes getting from the network operating system.
The data source provides data to the receiver. The receiver has various APIs and connectors, and it needs to be able to filter optionally. The main technology of this part is flume and rest. Flume is an open source distributed service, which is used to collect, aggregate and transmit a large number of log data. Rest is an interface to access large databases.
The data is further provided to the semantic processor to parse the data fields, and can also supplement the data, such as the association between IP and identity. The technical implementation here is redis. The semantic processor also needs filters to filter some events that do not need to be processed, such as the internal backup of data between two IPS, which will be filtered out if there is no need for processing in terms of security. Other configurable properties are also very important. For data resolution configuration, users and IP are associated, data properties are associated with external properties, and filters can also be adjusted.
After data processing, it is distributed to the distribution module for real-time and offline processing.
Storm or Spark Streaming can be used in real time. There is a further division of labor here, which will be explained in detail later. Different machine learning models can be analyzed here, and safety related scores can be generated.
The index after scoring is provided to UI user interface, which includes visual map, threat alarm, etc. at the same time, it can also output action directly, and the monitored data will be stored in database persistently. If security personnel need to investigate, the data will be retrieved from the database. In case of false alarm, the analysis results shall be fed back to the database.
In event investigation, security personnel may need to obtain data from multiple channels, so an access layer is provided here, which includes APIs of various databases and user interfaces.
Offline infrastructure includes SQL access to SQL repositories, time series databases for storing timestamps, and graph databases. Diagram shows the association between entities and exceptions, interaction between users, time series, exception nodes, etc., with some additional notes. Therefore, graph data is an important tool for data analysis.
Offline batch analysis can obtain data from time series, graph data, SQL storage, and other three parties. Model management includes model registration and storage, registering and storing model type definitions, and storing model states.
There are also some other piecemeal modules: one of the requirements of the model is to share with other modules, such as multinational companies, different infrastructure deployment places, and the security map can also be shared. The bottom layer is Hadoop. In addition, a control layer is needed to monitor the operation of the platform itself.
In the real-time processing, it is divided into two modules, which respectively represent the anomaly detection and threat detection stages. The output of anomaly analysis is to the threat analysis module. In practice, the two stages can be the same module in stages and executed with different models.
The exception detection is output to the exception writer, whose function is to store the exception information in the database and synchronize it in the time series database, HBase and graph database. After the event determines the exception, update the event correlation diagram, which is recommended to aggregate by frequency, such as once a day. The exception analysis output is also stored in the same way.
User entity behavior analysis (ueba)
Ueba detects anomalies and threats through the behavior baseline of various interactive entities, which is determined by comparison with the baseline. The platform adaptively changes the behavior baseline according to the data and supports multiple machine learning models.
The above figure shows the process of building a behavioral baseline. For example, person a uses server S1 to access the source code server, which is a daily work and occasionally accesses server S3. Therefore, the platform forms a baseline based on the network access activities of person a. The same is true for person B.
But in fact, it can not only generate baselines for users, but also create baselines for any type of entities, including users, groups, devices, device groups, applications, etc. In the above example, you can also generate a time-based baseline based on server S3. The baseline can be continuously updated according to the time of new reception (including real-time and batch), i.e. it can be adaptive. Assuming that person B starts to visit S1 server frequently, and this access is judged as legal, his baseline is automatically updated.
Through the incoming event data, it is detected by comparing with the entity baseline. The changing threshold value can be defined statically / dynamically, and exceeding the threshold value is considered abnormal. The comparison can be based on a variety of technologies, such as time series analysis to determine the number of logins per hour, machine learning or graph analysis to detect the execution of various machine learning models.
The above is the overall framework. Later chapters will introduce the details of various components, including data access and preparation engine, processing engine, real-time / offline configuration, machine learning model and different applications, interaction, etc.
Click to read the original text. In order to earn 10000 yuan, he submitted a draft in vsrc!