Hacking Book | Free Online Hacking Learning

Home

attack detection based on machine learning (2)

Posted by harmelink at 2020-03-14
all

In the last article, we talked about the simple structure of LSTM and the process of forward propagation. In this phase, we will return to our target and use LSTM to detect attacks

We know that one of the advantages of cyclic neural network is that it can save some sequence information of structured data and make a better understanding of it. So can we do the same for our attacks?

Let's take a look at our example in: Attack Detection Based on machine learning (1):

We use the idea of oneness, which is one text! It's just that it's not the human language, I am a boy, but the computer language. But whether it's the human language or the computer language, it needs to comply with certain rules and specifications. For example, a URL, its complete syntax should be:

Protocol + domain name + path + port + user name / password

To form

Or for a parameter ID whose normal value is 1, 2, 3, 4, a union select 1 is suddenly added. That must be a problem (or not in line with certain rules)

Based on this idea, we can also classify the request URL as a text content

Or not much, code up

First, import our common libraries:

import os import sys import numpy as np import pandas as pd import sklearn import matplotlib.pyplot as plt #import urlparse import urllib from urllib.parse import urlparse import math from sklearn import preprocessing from sklearn.utils import shuffle import gensim from keras.preprocessing import sequence from keras.models import Sequential from keras.layers import Dense,Embedding from keras.layers import LSTM from gensim.models import Word2Vec

Next, we propose the label as a pandas. Seris to save:

def getlabel(x): if x == 0: return 0 elif x == 1: return 1

Then there is the data preprocessing of LSTM. Generally speaking, the data we bring into LSTM cannot be I am a boy. Who can stand 2333

General processing methods are called word embedding, including n-gram, word2vec, doc2vec and so on. Their purpose is to turn text information into numerical eigenvectors

For example, I am a boy, after embedding, should look like this:

Note that the X1, X2, X3, X4 of each line are different. They may be 1 * 4 numerical matrix or 1 * n

In the code, we use gensim library to embed word2vec

def getw2v(url_list,label_list): stop = [] w2v_list = [] for i in range(0,url_list.size): tmp = [] name = url_list[i] for j in range(0,len(name)): tmp.append(name[j]) w2v_list.append(tmp) model = Word2Vec(w2v_list,min_count = 5) model.wv.save_word2vec_format('word2vec.txt',binary=False) label_vect = [] wv_vect = [] for i in range(0,url_list.size): name = url_list[i] tmp = [] vect = [] for j in range(0,len(name)): if name[j] in stop: continue tmp.append(model[name[j]]) if j >= 49: break if len(tmp) < 50: for k in range(0,50-len(tmp)): tmp.append([0]*100) vect = np.vstack((x for x in tmp)) wv_vect.append(vect) label_vect.append(label_list[i]) wv_vect = np.array(wv_vect) label_vect = np.array(label_vect) return wv_vect,label_vect

The above code expresses the single letter or character in a 1 * 100 dimension matrix by training word2vec, and stores the result in a TXT file

After that, we map each word into a 1 * 100 matrix for each URL request. Of course, since LSTM needs to input fixed length data, for each request, we intercept the first 50 characters to build the matrix, and discard the rest 50. For URL requests less than 50, we make up the 0 matrix of 1 * 100 to make the length For 50

The next step is to read our dataset and process it accordingly:

normal_data = pd.read_csv('normal.csv') abnormal_data = pd.read_csv("risk.csv") normal_data['label'] = normal_data['url'].map(lambda x:getlabel(0)).astype(int) abnormal_data['label'] = abnormal_data['url'].map(lambda x:getlabel(1)).astype(int) abnormal_data = abnormal_data.drop(['id','risk_type','request_time','http_status','http_user_agent','host','cookie_uid','source_ip','destination train_data = pd.concat([normal_data,abnormal_data],axis = 0) train_data = shuffle(train_data) w2v_word_list,label_list = getw2v(train_data['url'].values,train_data['label'].values)

Then we divide the test set and verification set as we like

x_train = w2v_word_list[0:8000] y_train = label_list[0:8000] x_test = w2v_word_list[8000:] y_test = label_list[8000:] model = Sequential() model.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) model.add(Dense(1,activation='sigmoid')) model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) print('training now......') model.fit(x_train,y_train,nb_epoch=50,batch_size=32) print('evalution now......') score,acc = model.evaluate(x_test,y_test) print(score,acc)

The final results are as follows