# document scanning: the practice of deep neural network on the mobile end - youdao technology salon blog

Xian Yushu (Youdao senior R & D Engineer)

With the successful application of deep learning algorithm in the field of image, the academia's attention is back to neural network; with alphago's big news in the field of go, the whole scientific and technological community's attention is focused on the keywords of "deep learning" and "neural network". What is not completely consistent with the public's impression is that the neural network algorithm is not a very high and obscure algorithm; compared with some algorithms with strong mathematical taste in machine learning, the neural network algorithm can even be regarded as "simple and rough". However, in the process of neural network training, as well as the practical application of the algorithm, there are many difficulties, and some experiences, which are more skilled.

The neural network algorithm is used in the document scanning function that Youdao cloud notes updated recently. This paper attempts to use the neural network algorithm used in document scanning algorithm as a clue to talk about the principle of neural network algorithm and its application in engineering.

First of all, what is the document scanning function. The document scanning function hopes to recognize the area where the document is located in the photos taken by the user, stretch (scale restore), recognize the text, and finally get a clean picture or a text note with format. The following steps are required to realize this function:

1. Identify document area

Find out the document from the background, and determine the four corners of the document;

2. Stretch the document area to restore the aspect ratio

According to the coordinates of the four corners of the document and the perspective principle, the original width height ratio of the document is calculated, and the document area is stretched to a rectangle. This is the only step with an analytic algorithm

3. Color enhancement

According to the type of document, different methods of color enhancement are selected to make the color of document picture clean;

4. Layout identification

Understand the layout of document pictures and find out the text part of the document;

5. OCR

Recognize the "characters" in the form of pictures into the encoded characters;

6. Generate notes

According to the layout of the document picture, the formatted notes are generated from the results of OCR.

In the above steps, "stretch document area" and "generate notes" have parsing algorithm or clear rules, and do not need machine learning processing. The rest of the steps contain machine learning algorithms. Among them, "document region recognition" and "OCR" are two steps that we use the deep neural network algorithm to complete.

The reason why deep neural network algorithm is selected in these two steps is that other algorithms are difficult to meet our needs:

• The scene is complex, and shallow learning is difficult to promote;

At the same time, some difficulties of deep neural network are relatively less difficult in these two steps

• It belongs to the field of image and time sequence which the depth neural network algorithm is good at;
• Can get a lot of data. Be able to mark these data clearly.

In the following content, we will start to talk about the neural network algorithm in the "document region recognition" step.

2、 Algorithm

The neural network algorithm used in document region recognition is FCN 1. Before the introduction of FCN, first of all, it briefly introduces the foundation of FCN, convolutional neural network (assuming that the reader has the most basic understanding of artificial neural network).

（1） Convolutional neural networks (CNN)

Convolutional neural network (CNN) was proposed as early as 1962, and the most widely used structure is probably Lecun's 3 in 1998. CNN is composed of input layer, output layer and several hidden layers, just like the common neural network.

Each layer of CNN is not one-dimensional, but has three dimensions (length, width, channel number). For example, if the input layer is an RGB picture, the three dimensions of the input layer are (picture height, picture width, 3).

Compared with the common neural network, CNN has the following characteristics:

• A node in the nth layer is not related to all nodes in the Nth-1 layer, but only to nodes near its spatial location (Nth-1 layer);
• In the same layer, all nodes share the weight;
• There is a pool layer every several layers. Its function is to reduce the length and width of this layer (usually by half). There are two common pool methods: local maximum (max) and local mean (mean).

By adding some pool layers, the length and width of hidden layer in CNN are shrinking. When the length and width are reduced to a certain extent (usually a single digit), CNN connects a traditional fully connected neural network at the top, and the whole network structure is completed.

CNN is effective because it makes use of some constraints in the image. Feature 1 corresponds to the local correlation of the image (a point in the upper right corner of the image has little relationship with a point in the lower left corner of the image); feature 2 corresponds to the translation invariance of the image (the shape of the upper right corner of the image, moving to the lower left corner is still that shape); feature 3 corresponds to the scaling invariance of the image (after image scaling, little information is lost). The addition of these constraints is just like the discovery of "conservation of momentum" in physics. The conservation theorem can make the motion of the object predictable, while the addition of constraints can make the recognition process controllable, reduce the demand for training data, and make it more difficult to over fit.

（2） FCN (full convolutional networks)

FCN is an algorithm developed on the basis of CNN. Unlike CNN, FCN has to solve the problem that the target of image recognition is not image level label, but pixel level label. For example:

• Image segmentation needs to divide the image into several categories according to the semantics, and each pixel corresponds to a classification result;
• Edge detection needs to separate the edge part and the non edge part of the image, and each pixel corresponds to "edge" or "non edge". We are facing such problems. )
• Video segmentation uses image segmentation in continuous video images.

In CNN, the pool layer reduces the length and width of the hidden layer, while FCN faces the full length and width label. How to deal with this contradiction?

One way is not to use the pool layer, so that the length and width of each hidden layer are equal to the full length and width. The disadvantage of doing this is that, first, the computation is quite large, especially when the computation reaches the higher level of CNN and the number of channels reaches hundreds of thousands; second, without using the pool layer, the convolution is always carried out in the local area, so the recognition results do not use the global information.

Another way is to transform the convolution, which can be understood as the pool layer of reverse operation or the upper sampling layer. The hidden layer is retracted by interpolation. This is the approach adopted by FCN. Of course, because the length and width of CNN's last hidden layer is very small, there is basically only global information. If only the hidden layer is sampled up, the local details will be lost. For this reason, FCN will do the same up sampling for several hidden layers in the middle of CNN. Due to the low degree of amplification and contraction of the middle layer and the retention of more local details, the results of up sampling will also contain more local information. Finally, the results of several up sampling are combined as output, so that the global and local information can be well balanced.

The structure of the whole FCN is shown in the figure above. FCN removes CNN's full connection layer at the top. Before each transposed convolution layer, there is a classifier. The output of the classifier is sampled (transposed convolution) and then added.

The above figure is the real up sampling result in our experiment. It can be seen that the lower level of the hidden layer retains a lot of picture details, while the higher level of the hidden layer has a better understanding of the global distribution. By combining them, we get the result that both global information and local information are not lost.

（3） Revolution transform

Compared with the diagram provided by conv Φ arithmetric, it can be seen that the above figure is just the up and down flip of the convolution diagram. In the actual operation, the value of a node in the input layer (weighted by convolution kernel) is added to each output layer node related to the node.

In terms of dimensions, if the height and width of convolution kernel are h and W, the number of channels in input layer is C, and the number of channels in output layer is O, then the number of input nodes in a positive convolution is h * w * C, and the number of output nodes is o; while the number of input nodes in a transposition convolution operation is C, and the number of output nodes is h * w * o.

（4） Improved cross entropy loss function

In the problem of edge recognition, every pixel corresponds to a kind of "edge non edge". Therefore, we can think that every pixel is a training sample. This brings about a problem: usually the edge of the picture is far less than the non edge, so the number of the two types of samples is very different. In the problem of pattern recognition, category imbalance will lead to many uncontrollable results, which should be avoided as much as possible.

In general, in this case, we will use repeated sampling (oversampling) for small sample categories, or generate artificial data based on the spatial distribution of the original samples. However, in this problem, because the same graph contains many samples, neither of the two methods can be used. How to solve the problem of large sample size?

In 2015, a paper 4 on iccv proposed an edge recognition model called hed, trying to solve this problem by changing the definition of loss function. This method is also used in our algorithm.

First of all, we will give an overview of CNN's commonly used cross entropy loss function. In the two classification problem, the definition of cross entropy is as follows:

Here, l is the loss value, n is the number of samples, K is the number of samples, q is the tag value, and the value is 0 or 1. P is the probability of "the sample belongs to category 1" calculated by the classifier, which is between 0 and 1.

Although this function looks complex, if we take the index (L = exp (- L)), we will find that this is the probability that all samples predict correctly. For example, the tag values of the sample set are (1, 1, 0, 1, 1, 0,...) ):

Here l is the likelihood function, i.e. the probability that all samples are predicted to be correct.

Hed uses the weighted cross entropy function. For example, when there are few samples corresponding to tag 0, the weighted cross entry function is defined as:

Here W is the weight, which needs to be greater than 1. Let w = 2, and then consider the likelihood function:

It can be seen that the samples with category 0 appear repeatedly in the likelihood function, and the proportion increases accordingly. In this way, although we can't actually expand the number of samples of small sample categories, we can achieve the basic equivalent effect by modifying the loss function.

3、 Data section

The neural network algorithm used in document region recognition is introduced here. Next, we will talk about the data set we built for training this neural network.

（1） Data filtering

In order to train the neural network model, we label the dataset with a sample size of about 50000. However, there are a lot of bad data in these data sets, which need to be further screened.

It's too expensive to filter data sets of about 50000 by hand. Fortunately, according to some experience judgment such as the degree of freedom of the network, our network does not have such a high demand for the size of the data set, and the data set is relatively rich, which can allow a part of good data to be mistakenly screened.

Based on this premise, we manually annotate a small training set (500 pieces) and train an SVM classifier to automatically filter the data. This classifier can only judge whether the image contains complete documents, and the classification effect is not particularly strong. However, we selectively emphasize the accuracy of classifier classification, but the recall rate is not high. In other words, this classifier can accept that the pictures with documents are wrongly divided into pictures without documents, but it cannot accept that the pictures without documents are divided into pictures with documents.

Relying on this classifier, we can get a small dataset of about 9000 by filtering about 50000 datasets. Combined with manual screening, the final remaining capacity is about 8000, with guaranteed quality data set.

4、 Implementation

In model training, we use tensorflow framework 5 for model training. Our ultimate goal is to realize the document area recognition function on the mobile terminal (mobile terminal), while there are some differences between the mobile terminal and the desktop terminal:

• The computing power of mobile terminal is weaker than that of desktop;
• Because of the limitation of bandwidth and power consumption, the mobile video card is weaker than the desktop one;
• There are two camps of IOS and Android on the mobile side. They have different optimization APIs for intensive computing, so the code is difficult to be used universally;
• The mobile side is sensitive to file volume.

Because of these differences, we can't directly migrate the model to the mobile terminal, but we need to optimize them to ensure their efficiency. There are two ways to optimize:

• Select the appropriate neural network framework and use the chip acceleration technology as much as possible;
• Compress the model, reduce the calculation cost and file volume of the model without losing the accuracy.

（1） Selection of neural network framework

At present, the popular neural network frameworks include tensorflow, caffe6, mxnet7 and so on. Most of them have corresponding mobile end frameworks. So it is the most convenient choice to use these mobile frames directly. For example, if we use tensorflow framework for model training, we can directly use the mobile tensorflow framework to save the trouble of model transformation.

Sometimes, we may not need a large and complete neural network framework, or we need high efficiency. At this time, we can consider a lower level framework, on this basis to achieve our own needs. Examples of this include eigen8, a common matrix operation library, nnpack9, the bottom Library of neural network with high efficiency, etc. If opencv10 has been integrated in the code, you can also consider the operation API.

If the requirement of running efficiency is very high, we can also consider using the heterogeneous computing framework of mobile terminal, and add the computing power of GPU and DSP in addition to CPU. In this aspect, we can consider the following frameworks: metal11 on IOS, opengl12 and vulkan13 on cross platform, and renderscript14 on Android.

（2） Model compression

The simplest method of model compression is to adjust each adjustable super parameter in the network model. The examples of super parameters here are: the total number of layers in the network, the number of channels in each layer, the kernel width of each convolution, etc. At the beginning of training, we will choose some redundant super parameters to train, to ensure that a super parameter is not too small to become the bottleneck of network effect. When the model is compressed, we can "squeeze out" the redundancy, that is, we can gradually try to reduce a certain super parameter without significantly reducing the recognition accuracy. In the process of adjustment, we find that the total number of network layers has a greater impact on the recognition effect; relatively speaking, the reduction of the number of channels in each layer has little impact on the recognition effect.

In addition to the simple adjustment of super parameters, there are also some model structures specially designed for the mobile end, which can significantly compress the model. Examples of this include SVD network15, squeezenete16, mobilenets17, etc. I won't go into details here.

（3） Final effect

After the customization of neural network framework and model compression, the size of our model is compressed to about 1m, which can achieve the speed of recognizing a picture within 100ms on the mainstream mobile phone (iPhone 6, millet 4 or better equipped mobile phone), and the recognition accuracy is basically not affected. It should be said that the transplantation is very successful.

Five, summary

Two or three years ago, neural network algorithm was only applied to servers with strong computing power in everyone's eyes, which seemed to have nothing to do with mobile phones. However, in the past two or three years, there have been some new trends: first, with the maturity of neural network algorithm, some scholars focus on the compression of neural network computing costs, neural network model can be compressed; second, the rapid development of mobile phone chip computing power, especially the development of GPU, DSP computing power. With this decline, mobile phones can also get the computing needs of neural networks.

The function of "document scanning based on neural network" has been realized, which has been completed on the shoulders of countless predecessors. From this point of view, our generation of R & D personnel are lucky to be able to achieve something we didn't dare to imagine in the past and more in the future.

Reference

• Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3431-3440).↩ Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1), 106-154.↩ LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.↩ Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1395-1403).↩ https://www.tensorflow.org/↩ http://caffe.berkeleyvision.org/↩ http://mxnet.io/↩ http://eigen.tuxfamily.org/index.php?title=Main_Page↩ https://github.com/Maratyszcza/NNPACK↩ http://opencv.org/↩ https://developer.apple.com/metal/↩ https://www.opengl.org/↩ https://www.khronos.org/vulkan/↩ https://developer.android.com/guide/topics/renderscript/compute.html↩ Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems (pp. 1269-1277).↩ Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.↩ Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.↩