## Write at the beginning

Before proceeding to the next article, I want to introduce what is LSTM and the simple concept and structure of recurrent neural network. Otherwise, code is put all the time, which is quite uninteresting 2333. In the next article, I will bring: Attack Detection Based on machine learning (II) - Implementation Based on LSTM

The author is not from a strong background of ml / dl, especially NLP, which has no deep theoretical basis (CV is OK 2333), so I just want to talk about how LSTM and its predecessor RNN work here, and I hope to point out that~

## What is cyclic neural network

Cyclic neural network, also known as recurrent neural network (RNN), is a kind of network structure different from the traditional full connect network or revolution network. Its main characteristics are

First of all, in the case of only one layer and one neuron, if an input X:

Where n means that this data sample is composed of N characteristic dimensions. When it passes through a neuron C, it must be in the form of:

If there is only one neuron and it is a single layer, then it can set another softmax function to output. Here we simply use SIGMOD instead:

## So what if it's multi-layer?

We know that whether it is a fully connected network or a convolutional network, its propagation is forward, that is to say, the output of the former hidden layer is used as the input of the latter neuron to iterate, In addition, we do not do other operations. If we think about the perceptual point, it is a straight-line way of communication, and RNN is powerful. It not only spreads forward, but also spreads to the side

## So why did he do it?

Very simply, in order to be able to save the information of the previous element in the sequence. It's a little hard to understand. For example, when we do NLP tasks, one of the expected data is: I like play dota2

After participle, his input actually becomes ['I ',' like ',' play ',' dota2 ']

After embedding, they are brought into neural network computing one by one, so they are quite independent data. Obviously, our expectation has a clear context order, so the simple forward propagation loses this context relation, which can not be separated in the language, because I like play dota2 and dota2 like play I mean two things

The more obvious explanation is that when we want to preserve such a sequence relationship to ensure our model to make word prediction again, we think that [I am a man] will be more logical and common sense than [I am watching a movie] because word order lets us know that the table of the main system table should not be a verb structure.

All in all, the goal of RNN is to be able to process sequential data at each level and to store such sequential information. At this point, the author thinks that there are some similarities between the idea of adding the residual structure to the residual network (RESNET) to obtain the information of the previous hidden layer

PS: the first time I read about RNN, what I thought was that it was very similar to some codebook formats in cryptography, such as ECB and FCB

## So next, it's how to implement RNN

His implementation method is as follows: we still use single neuron as an example, but this time our input data is sequential, such as a text sequence: I am a boy we think its input is sequential, that is:

- I in first place
- The next is am.
- The second is a.
- The last is boy.

After embedding the text data, every input, such as I, will become a numerical vector: similar to having N characteristic dimensions, it may be as follows:

Where the following table of a indicates the order of occurrence, [x1, X2, X3, X4] indicates that each word after embedding is represented by a 1 * 4 vector

So when going through neurons, for example, A1 goes through neurons, it looks like this:

Where w (1) is the weight corresponding to A1

It's very simple. As before, multiply the weight and add a bias B to get the output of this neuron through SIGMOD function

## The following is the essence of RNN. How do we keep the sequence information?

Let's see how A2 and A2 go through this neuron? He's like this:

As you can see, the calculation process of output2 is no longer just multiplying the weight plus the offset B, but it has a new item:

H1 * Wh1,

Where does H1 come from? The former input A1 is calculated! Such a structure is equivalent to the influence of the previous input (for IAMA boy, calculating am is considering the influence of the previous data I) when calculating A2 again!

Similarly, the calculation process of A3 is as follows:

We use pictures to intuitively understand him:

It should be noted that the propagation process in the figure is not the horizontal relationship between the front and back hidden layers, but H1, H2, H3, H4 are all the same nodes in the same hidden layer!! Just bring in different data!

Therefore, the rigorous chart should be as follows:

On the right is what we have shown above. In fact, it is the unfolding display on the left! They're basically the same neurons in the same layer

## OK, RNN that's all. What about LSTM?

LSTM is the advanced form of RNN. Its full name is short-term recurrent neural network

Compared with RNN, it takes a more scientific approach to the reception of the previous level input:

For the information of the previous layer, it is not the overall acceptance, but the acceptance of a part (or a certain amount, a percentage)

Its structure is more complex:

We use the pictures in the deep learning handout of Mr. Li Hongyi, shuaizhai, the second generation of Taiwan University to demonstrate:

The first time I saw this picture, it seems that it's not the same as the RNN... 2333

So let's explain what the parameters on this diagram mean

- C (note that it's not the blue c ', it's the next point C): it represents the output of the previous data, that is, the output of H1 (I) in RNN
- Z (bottom): it represents the input of this round, that is,
- Then there will be an additional external input Zi at this time,
- The multiplication of G (z) and f (Zi) is equivalent to X * W in RNN, or let's put it another way, because SIGMOD is between a specification and 0-1, and its multiplication with G (z) means that the percentage of G (z) can be taken into the next calculation
- ZF, the input of forgetting gate, the former indicates the input of this round, and the latter indicates the percentage of the output of the previous layer that can affect the output of this layer. F (GZ) is that percentage, C is the output of the previous round
- Z0, output gate, similarly, it will go through a SIGMOD function specification to 0-1 and multiply by C 'to indicate the output percentage

23333 the author's writing skills are really urgent.... but the thinking is still very clear HHH, in fact, that is to say, in each calculation on the basis of RNN [the last round of output, the current round of input, the current round of output these three values all make a percentage choice]

## Then someone will ask:

z. How do these three values of Z0 and ZF come from? Does the new parameter need to be optimized?

Use another picture in the handout to answer the first question:

In the figure, X1 and X2 are equivalent to the input data of this round a = [x1, X2]

And Z, Z0, ZF are generated by these two data, how to generate, multiply with weight and add again! Twenty-three thousand three hundred and thirty-three

That's it!

As for the second question, yes! The above W1, W2, W3, W4, W5, W6 all need training

Is that almost understandable?

Next, it is the same as RNN. If it is multi-layer, it will be piled up one by one!

## summary

This paper only introduces RNN and LSTM, focusing on its infrastructure and the process of forward propagation. For some more advanced content, such as two-way RNN, attention mechanism and so on, it is not involved. The next one is:

Attack detection based on machine learning (2) - Implementation Based on LSTM