In the last article, we talked about the simple structure of LSTM and the process of forward propagation. In this phase, we will return to our target and use LSTM to detect attacks
We know that one of the advantages of cyclic neural network is that it can save some sequence information of structured data and make a better understanding of it. So can we do the same for our attacks?
Let's take a look at our example in: Attack Detection Based on machine learning (1):
- http://www.aaa.com/ccc/?id=1/**/aNd/**/1>0
- http://www.aaa.com/ccc/?search=<script>alert(document.cookie)</script>
- http://www.aaa.com/ccc/?dict=../../../etc/passwd
We use the idea of oneness, which is one text! It's just that it's not the human language, I am a boy, but the computer language. But whether it's the human language or the computer language, it needs to comply with certain rules and specifications. For example, a URL, its complete syntax should be:
Protocol + domain name + path + port + user name / password
To form
Or for a parameter ID whose normal value is 1, 2, 3, 4, a union select 1 is suddenly added. That must be a problem (or not in line with certain rules)
Based on this idea, we can also classify the request URL as a text content
Or not much, code up
First, import our common libraries:
import os
import sys
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
#import urlparse
import urllib
from urllib.parse import urlparse
import math
from sklearn import preprocessing
from sklearn.utils import shuffle
import gensim
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense,Embedding
from keras.layers import LSTM
from gensim.models import Word2Vec
Next, we propose the label as a pandas. Seris to save:
def getlabel(x):
if x == 0:
return 0
elif x == 1:
return 1
Then there is the data preprocessing of LSTM. Generally speaking, the data we bring into LSTM cannot be I am a boy. Who can stand 2333
General processing methods are called word embedding, including n-gram, word2vec, doc2vec and so on. Their purpose is to turn text information into numerical eigenvectors
For example, I am a boy, after embedding, should look like this:
Note that the X1, X2, X3, X4 of each line are different. They may be 1 * 4 numerical matrix or 1 * n
In the code, we use gensim library to embed word2vec
def getw2v(url_list,label_list):
stop = []
w2v_list = []
for i in range(0,url_list.size):
tmp = []
name = url_list[i]
for j in range(0,len(name)):
tmp.append(name[j])
w2v_list.append(tmp)
model = Word2Vec(w2v_list,min_count = 5)
model.wv.save_word2vec_format('word2vec.txt',binary=False)
label_vect = []
wv_vect = []
for i in range(0,url_list.size):
name = url_list[i]
tmp = []
vect = []
for j in range(0,len(name)):
if name[j] in stop:
continue
tmp.append(model[name[j]])
if j >= 49:
break
if len(tmp) < 50:
for k in range(0,50-len(tmp)):
tmp.append([0]*100)
vect = np.vstack((x for x in tmp))
wv_vect.append(vect)
label_vect.append(label_list[i])
wv_vect = np.array(wv_vect)
label_vect = np.array(label_vect)
return wv_vect,label_vect
The above code expresses the single letter or character in a 1 * 100 dimension matrix by training word2vec, and stores the result in a TXT file
After that, we map each word into a 1 * 100 matrix for each URL request. Of course, since LSTM needs to input fixed length data, for each request, we intercept the first 50 characters to build the matrix, and discard the rest 50. For URL requests less than 50, we make up the 0 matrix of 1 * 100 to make the length For 50
The next step is to read our dataset and process it accordingly:
normal_data = pd.read_csv('normal.csv')
abnormal_data = pd.read_csv("risk.csv")
normal_data['label'] = normal_data['url'].map(lambda x:getlabel(0)).astype(int)
abnormal_data['label'] = abnormal_data['url'].map(lambda x:getlabel(1)).astype(int)
abnormal_data = abnormal_data.drop(['id','risk_type','request_time','http_status','http_user_agent','host','cookie_uid','source_ip','destination
train_data = pd.concat([normal_data,abnormal_data],axis = 0)
train_data = shuffle(train_data)
w2v_word_list,label_list = getw2v(train_data['url'].values,train_data['label'].values)
Then we divide the test set and verification set as we like
x_train = w2v_word_list[0:8000]
y_train = label_list[0:8000]
x_test = w2v_word_list[8000:]
y_test = label_list[8000:]
model = Sequential()
model.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print('training now......')
model.fit(x_train,y_train,nb_epoch=50,batch_size=32)
print('evalution now......')
score,acc = model.evaluate(x_test,y_test)
print(score,acc)
The final results are as follows