Hacking Book | Free Online Hacking Learning


building an open source siem platform for enterprise security construction (middle)

Posted by chiappelli at 2020-02-27


Siem (security information and Event Management), as the name implies, is a management system for security information and events. Most enterprises are not cheap security systems. Based on the author's experience, this paper introduces how to use open-source software to analyze data offline and use attack modeling to identify attack behavior.

Review system architecture

Take the database as an example, collect the query logs of MySQL through logstash, back up to the HDFS cluster in near real time, and analyze the attack behavior offline through Hadoop script.

Database log collection

There are three common ways to collect data logs:

Mirror mode

Most database audit products support this mode. By analyzing database traffic, decoding database protocol, identifying SQL prediction, and extracting SQL logs

Agency mode

The typical mode is DB proxy. At present, Baidu, Sohu, meituan, Jingdong and other related open-source products are available. The front end accesses the real database server at the back end through DB proxy. SQL logs can be collected directly on DB proxy.

Client mode

Through the installation of the client in the database server to collect SQL logs, a typical way is to collect the logs through logstash. In this paper, the client mode is explained, and the other ways are essentially similar.

Logstash configuration


Download logstash https://www.elastic.co/downloads/logstash current version 5.2.1

Open MySQL query log

MySQL query log

Configure logstash

localhost:work maidou$ cat mysql.conf

Input {

File {

type => "mysql_sql_file"

path => "/var/log/mysql/mysql.log"

start_position => "beginning"

sincedb_path => "/dev/null"



Output {

kafka {        broker_list => "localhost:9092"        topic_id => "test"        compression_codec => "snappy" # string (optional), one of ["none", "gzip", "snappy"], default: "none"    }


Run logstash

bin/logstash -f logstash.conf

Examples of logs

2017-02-16T23:29:00.813Z localhost 170216 19:10:15         37 Connect

[email protected] on

2017-02-16T23:29:00.813Z localhost                 37 Quit

2017-02-16T23:29:00.813Z localhost                 38 Connect   [email protected] on

2017-02-16T23:29:00.813Z localhost                 38 Query     SHOW VARIABLES LIKE 'pid_file'

segment text by words

The simplest operation is not to cut words. If you like to automatically cut out database name, time and other fields, please refer to:

Gtok syntax


Gtok syntax debugging


Common attack features

To build a shooting range environment with the common wavsep, please refer to another article of my guide to build a shooting range based on wavsep, http://www.freebuf.com/sectool/125940.html

Use SQL to scan links

Analyze the attack characteristics. Here are two. Please summarize more attack characteristics

Feature 1


A large number of null fields will be generated when enumerating data with union query

Features 2 and 3

When enumerating the database structure, information ﹣ schema will be used. In addition, group by x) a will be used by individual scanners.)

2017-02-16T23:29:00.998Z localhost    46 Query SELECT username, password FROM users WHERE username='textvalue' AND (SELECT 7473 FROM(SELECT COUNT(*),CONCAT(0x7171716271,(SELECT (CASE WHEN (8199= 8199) THEN 1 ELSE 0 END)),0x717a627871,FLOOR(RAND(0)*2))x FROM INFORMATION_SCHEMA.PLUGINS GROUP BY x)a)-- LFpQ' AND password='textvalue2'

Hadoop offline processing

Hadoop is based on map and reduce model

A simplified understanding is:

cat data.txt | ./map | ./reduce

During the most simplified period, we can only develop the map program to process the log data line by line in the map to match the attack behavior.

Perl script development, python similar

#!/usr/bin/perl -w

my $rule="(null,){3,}|information_schema|GROUP BY x\\)a\\)";

My $line= "";



if( $line=~/$rule/i )





Just run it under Hadoop.

production environment

The rules in the production environment will be much more complex than this, and you need to continue to supplement them. Here are just examples

There will be a large number of duplicate alarms if only map is written. Reduce needs to be developed for aggregation

In case of emergency response, you need to know which database SQL is injected into and which account is used. This needs to be supplemented when logstash cuts the field

In case of emergency response, it is better to know the corresponding links of SQL injection. In this case, it is necessary to associate the web access log with SQL log. The more mature scheme is to learn the time-based association matrix based on machine learning

The direct collection of SQL data by the client requires Mysql to open the query log, which has a great impact on the performance of the server. The large companies I know are mainly connected by DB prxoy. It is recommended to collect data on DB proxy.

There is a bottleneck in SQL injection based on rule recognition. Although it has some progress compared with Web log level and traffic level, SQL semantics has become an inevitable way.


Based on sandbox and algorithm mining attacks, see below.