Click the blue [datartisan data craftsman] under the title to follow
Emotion analysis is a common application of natural language processing (NLP), especially in the classification method aiming at extracting the emotional content of text. In this way, affective analysis can be regarded as a method to quantify qualitative data by using some affective scoring indicators. Although emotion is subjective to a large extent, there are many useful practices in emotion quantitative analysis, such as enterprise analysis of consumer feedback on products, or detection of poor evaluation information in online reviews. The simplest emotional analysis method is to use the positive and negative attributes of words to determine. Each word in the sentence has a score of + 1 for optimistic words and - 1 for pessimistic words. Then we sum up the scores of all words in the sentence to get a final emotional score. Obviously, this method has many limitations. The most important point is that it ignores the context information. For example, in this simple model, because the score of "not" is - 1 and the score of "good" is + 1, the phrase "not good" will be classified as a neutral phrase. Although the phrase "not good" contains the word "good", people tend to classify it as pessimistic. Another common approach is to treat text as a "word bag.". We see a 1xn vector for each text, where n represents the number of text terms. Each column in the vector is a word, and its corresponding value is the frequency of the word. For example, the phrase "bag of bag of words" can be encoded as [2, 2, 1]. These data can be applied to machine learning classification algorithms (such as Rogers regression or support vector machine) to predict the emotional state of unknown data. It should be noted that this supervised learning method requires using the data of known emotional state as the training set. Although this method improves the previous model, it still ignores the context information and the size of the dataset.
Word2vec and doc2vec recently, Google developed a method called word2vec, which can capture contextual information while compressing data size. Word2vec is actually two different methods: continuous bag of words (cbow) and skip gram. Cbow aims to predict the probability of current words according to the context. Skip gram is the opposite: the probability of predicting the context based on the current word (as shown in Figure 1). Both methods use artificial neural network as their classification algorithm. At first, each word is a random n-dimensional vector. After training, the algorithm uses cbow or skip gram to get the optimal vector of each word.
Now these word vectors have captured the context information. We can use basic algebraic formulas to find the relationship between words (for example, "King" - "man" + "woman" = "Queen"). These word vectors can be used instead of word bags to predict the emotional state of unknown data. The advantage of this model is that it not only considers the context information, but also compresses the data size (usually, the vocabulary size is about 300 words rather than 100000 words of the previous model). Because neural network can extract the information of these features for us, we only need to do a little manual work. However, due to the different length of the text, we may need to use the average value of all word vectors as the input value of the classification algorithm, so as to classify the whole text document.
However, even though the above models average word vectors, we still ignore the influence of word order on emotion analysis. As a summary method for dealing with variable length text, Quoc le and Tomas mikolov proposed doc2vec method. In addition to adding a paragraph vector, this method is almost equivalent to word2vec. Like word2vec, this model has two methods: distributed memory (DM) and distributed bag of words (dbow). DM attempts to predict the probability of words given context and paragraph vectors. In the training process of a sentence or document, the paragraph ID remains unchanged and shares the same paragraph vector. Dbow predicts the probability of a set of random words in a paragraph given only the paragraph vector. (as shown in Figure 2)
Once trained, these paragraph vectors can be incorporated into the emotion classifier without the addition of words. This method is the most advanced method at present. When it is used to classify IMDB movie comment data, the error rate of this model is only 7.42%. Of course, if we can't really implement it, everything is a floating cloud. Fortunately, the optimized versions of word2vec and doc2vec in genism (Python software library) are available.
In this section, we show how people use word vectors in emotion classification projects. We can find the genism Library in the anaconda distribution, or we can install the genism library through pip. From here, you can train the word vectors of your own corpus (a text dataset) or import the trained word vectors from the text format or binary format file.
I find it very useful to use Google's pre trained word vector data to build a model, which is based on Google News Data (about 100 billion words) training. It should be noted that the uncompressed size of this file is 3.5 GB. Using Google's word vectors, we can see some interesting relationships between words: interestingly, we can find grammatical relationships from them, such as identifying the most advanced or word form words: "biggest" - "big" + "small" = "smallest" "ate" - "eat" + "speak" = "stroke". From the above examples, we can see word2vec Recognize important relationships between words. This makes it very useful in many NLP projects and our sentiment analysis cases. Before we apply it to emotional analysis cases, let's test word2vec's ability to classify words. We will use a sample set of three categories: food, sports and weather words, which we can download from the enchanted learning website. Because this is a 300 dimensional vector, in order to visualize it in 2D view, we need to use the dimension reduction algorithm t-sne in scikit learn to process the source data. First of all, we must obtain the word vector as follows: then we use tsne and Matplotlib to visualize the classification results: the visualization results are as shown in the following figure: as can be seen from the above figure, word2vec is a good way to separate the unrelated words and to aggregate them.
Emotive analysis of Emoji tweets now we will analyze the emotional state of Emoji tweets. We use Emoji expressions to add fuzzy labels to our data. Smiley face (: -)) indicates optimistic mood, frown label (: - () indicates pessimistic mood. The total 400000 tweets are divided into two groups: optimistic and pessimistic. We randomly take samples from these two sets of data, and build 8:2 training set and test set. Then, we construct word2vec model for the training set data, where the input value of the classifier is the weighted average of all word vectors in the tweet. We can use scikit learn to build many machine learning models. First, we import the data and build word2vec model: next, in order to get the average value of all word vectors in the tweet using the following function, we must build word vectors as input text. Adjusting the dimension of data set is a part of data standardization. We usually transform the data set into a Gaussian distribution that obeys the mean value of zero, which means that the value greater than the mean value is optimistic, otherwise it is pessimistic. In order to make the model more effective, many machine learning models need to deal with the dimension of data set in advance, especially the model with many variables such as text classifier. Finally, we need to establish test set vectors and standardize them: next, we want to verify the effectiveness of the classifier by calculating the prediction accuracy and ROC curve of the test set. ROC curve measures the change of true positive rate and false positive rate when the model parameters are adjusted. In our case, we adjust the probability of the cutoff threshold of the classifier model. Generally speaking, the larger the AUC under ROC curve, the better the performance of the model. You can find more information about ROC curve here (https://en.wikipedia.org/wiki/receiver_operating_characteristic). In this case, we use random gradient descent method of Rogers regression as classifier algorithm. Then we use Matplotlib and metric libraries to build ROC curves. The ROC curve is shown in the following figure: without creating any type of features and minimum text preprocessing, the prediction accuracy of the simple linear model built by scikit learn is 73%. Interestingly, deleting punctuation will affect the prediction accuracy, which shows that word2vec model can extract the information contained in the symbols in the document. Processing individual words, training longer, doing more data preprocessing, and adjusting the parameters of the model can improve the prediction accuracy. I found that the ANN model can improve the prediction accuracy by 5%. It should be noted that scikit learn does not provide an implementation tool for Ann classifiers, so I used my own customized Library: the accuracy of classification results is 77%. For any machine learning project, choosing the right model is usually an art rather than a scientific behavior. If you want to use my custom library, you can find it on my GitHub home page, but the library is messy and not maintained on a regular basis! If you want to contribute, please copy my project at any time.
Conclusion I hope you have seen the practicability and convenience of word2vec and doc2vec. Through a very simple algorithm, we can get a wealth of word vectors and paragraph vectors. These vector data can be applied to a variety of NLP applications. What's more, Google has opened up its own pre training word vector results, which is based on a large data set that is difficult for others to obtain. If you want to train your own vector results in big data sets, now you have a word2vec implementation tool based on Apache spark. (https://spark.apache.org/mllib/)
Original author: Michael Czerny translation: fibears