Hacking Book | Free Online Hacking Learning


application of similarity between idf and cosine (1): automatic keyword extraction

Posted by harmelink at 2020-03-16

This title seems to be very complicated. Actually, I want to talk about a very simple problem.

There is a long article. I want to extract its key words by computer without any human intervention. How can I do it correctly?

This problem involves data mining, text processing, information retrieval and many other computer cutting-edge fields, but unexpectedly, there is a very simple classical algorithm, which can give quite satisfactory results. It's so simple that it doesn't need advanced mathematics. Ordinary people can understand it in only 10 minutes. This is the TF-IDF algorithm I want to introduce today.

Let's start with an example. Suppose there is a long article "bee farming in China", we are going to extract its key words by computer.

One easy way to think about it is to find the most frequently used words. If a word is important, it should appear many times in this article. Therefore, we conduct "term frequency" (abbreviated to TF) statistics.

As a result, you must have guessed that the most frequently used words are "de", "yes" and "Zai". They are called stop words, which means words that are not helpful for finding results and must be filtered out.

Suppose we filter them all out, and only consider the remaining words with practical meaning. In this way, we will encounter another problem. We may find that the three words "China", "bee" and "breeding" appear as often. Does this mean that, as keywords, they are of the same importance?

Obviously not. Because "China" is a very common word, relatively speaking, "bees" and "farming" are not so common. If these three words appear as often in an article, it is reasonable to think that "bee" and "breeding" are more important than "China", that is to say, in terms of keyword ranking, "bee" and "breeding" should be in front of "China".

Therefore, we need an importance adjustment coefficient to measure whether a word is a common word. If a word is rare, but it appears many times in this article, it is likely to reflect the characteristics of this article, which is exactly the key words we need.

To express in statistical language is to assign a weight of "importance" to each word on the basis of word frequency. The most common words ("de", "yes", "de") are given the minimum weight, the more common words ("China") are given the smaller weight, and the less common words ("bee", "farming") are given the larger weight. This weight is called "inverse document frequency" (IDF for short), and its size is inversely proportional to the common degree of a word.

When you know the word frequency (TF) and the inverse document frequency (IDF), multiply the two values to get the TF-IDF value of a word. The more important a word is to an article, the more TF-IDF it has. So, the first few words are the key words of this article.

Here are the details of the algorithm.

The first step is to calculate word frequency.

Considering the length of the article, in order to facilitate the comparison of different articles, "word frequency" is standardized.


The second step is to calculate the inverse document frequency.

At this time, a corpus is needed to simulate the language environment.

The more common a word is, the larger the denominator is, and the smaller the inverse document frequency is, the closer to zero. The reason for adding 1 to the denominator is to avoid that the denominator is 0 (that is, all documents do not contain the word). Log means to take the logarithm of the obtained value.

The third step is to calculate TF-IDF.

As you can see, TF-IDF is directly proportional to the number of occurrences of a word in the document and inversely proportional to the number of occurrences of that word in the entire language. Therefore, the algorithm of automatically extracting keywords is very clear, which is to calculate the TF-IDF value of each word in the document, and then arrange them in descending order to take the first few words.

Take "bee culture in China" as an example. If the length of this article is 1000 words, "China", "bee" and "culture" appear 20 times respectively, then the word frequency (TF) of these three words is 0.02. Then, Google found that there are 25 billion pages containing the word "de", assuming that this is the total number of Chinese pages. There are 6.23 billion pages including "China", 48.4 million pages including "bee", and 97.3 million pages including "breeding". Then their inverse document frequency (IDF) and TF-IDF are as follows:

It can be seen from the above table that the TF-IDF value of "bee" is the highest, that of "breeding" is the second, and that of "China" is the lowest. (if the TF-IDF of the word "de" is also calculated, it will be a value very close to 0.) So, if you choose only one word, "bee" is the key word of this article.

In addition to automatically extracting keywords, TF-IDF algorithm can be used in many other places. For example, in information retrieval, for each document, TF-IDF of a set of search terms ("China", "bee", "aquaculture") can be calculated separately, and the TF-IDF of the whole document can be obtained by adding them. The document with the highest value is the document most relevant to the search term.

TF-IDF algorithm is simple and fast, and the results are in line with the actual situation. The disadvantage is that simply measuring the importance of a word by "word frequency" is not comprehensive enough. Sometimes important words may not appear many times. Moreover, this algorithm can't reflect the position information of words. The words in front of the position and the words in the back of the position are regarded as the same importance, which is not correct. One solution is to give more weight to the first paragraph and the first sentence of each paragraph

Next time, I'll use TF-IDF with cosine similarity to measure the similarity between documents.