Hacking Book | Free Online Hacking Learning


do things with hownet in the era of deep learning

Posted by herskovits at 2020-03-17

At the end of December 2017, academician Zhang cymbal of Tsinghua University made a speech entitled "what should professors see on the eve of AI scientific breakthrough?"? 》Excellent special report. He believes that processing knowledge is what human beings are good at, and processing data is what computers are good at. If we can combine the two, we will be able to build a more intelligent system than human beings. Therefore, he proposed that the future scientific breakthrough of AI is to build an AI system based on knowledge and data at the same time.

I totally agree with the academic viewpoint of teacher Zhang Po. In the last year, we have also made some attempts in this regard. We have integrated the semantic annotation information in HowNet into NLP oriented deep learning model and achieved some interesting results. Let's organize and share with you here.

What is HowNet

HowNet is a large-scale language knowledge base marked by Mr. Dong Zhendong and Mr. Dong Qiang's father and son's achievements in the past 30 years. It mainly focuses on Chinese (including English) vocabulary and concepts [1]. HowNet adheres to reductionism and believes that words / meanings can be described in smaller semantic units. This kind of semantic unit is called sememe. As the name implies, it is atomic semantics, that is, the most basic and the smallest semantic unit that is not suitable to be subdivided. In the process of annotation, HowNet has gradually built a set of precise semantic system (about 2000 sememes). HowNet has annotated hundreds of thousands of lexical / semantic information based on the original semantic system.

For example, the word "vertex" has two representative semantic items in HowNet, which are labeled as follows: each "xx|yy" represents a semantic element, the left side of "|" is English and the right side is Chinese; complex semantic relations are also labeled between the semantic elements, such as host, modifier, long, etc., so as to accurately represent the semantic information of the word meaning.

Knowledge base resources have always played an important role in NLP field. WordNet is the most famous one in the English world. It uses the form of synonym set to mark the semantic knowledge of vocabulary / meaning. HowNet takes a different approach from WordNet, which can be said to be the most unique outstanding contribution made by Chinese scholars for NLP. HowNet has aroused great research enthusiasm in NLP academic circle around 2000. It explores the important application value of HowNet in terms of lexical similarity calculation, text classification, information retrieval, etc. [2,3], which contrasts with the international application exploration of WordNet at that time.

What is the use of HowNet in the era of deep learning

In the era of deep learning, people find that large-scale text data can also learn the semantic representation of vocabulary well. For example, the word representation learning method represented by word2vec [4] uses low dimension (generally hundreds of dimensions), dense, real value vectors to represent the semantic information of each word / word meaning, also known as distributed representation, or embedding, which uses the word context information in large-scale text to automatically learn the vector representation. We can use these vectors to calculate lexical / semantic similarity conveniently, which can achieve better results than traditional methods based on language knowledge base. Because of this, academic attention of HowNet and WordNet has declined significantly in recent years, as shown in the following two figures.

So, is the language knowledge base represented by WordNet and HowNet useless in the era of deep learning? This is not the case. In fact, since word2vec was just put forward a year ago, we [5] and ACL 2015 best student paper [6] and other work have found that integrating WordNet knowledge into the learning process of word representation can effectively improve the effect of word representation.

Although most NLP deep learning models haven't set aside a place for the language knowledge base at present, due to the characteristics of deep learning models such as data Hungary and black box, their development is encountering a bottleneck that can't be broken through. Looking back at academician Zhang cymbal's point of view, we firmly believe that the future scientific breakthrough of AI is to build an AI system based on knowledge and data at the same time. Seeing this situation clearly, the key problem of NLP deep learning model lies in what knowledge to use and how to use it.

In natural language understanding, HowNet is more close to the nature of language. Words in natural language are typical symbolic information, which contains rich semantic information. It can be said that vocabulary is the smallest unit of language use, but not the smallest semantic unit. HowNet's semantic tagging system is just an important way to break through the barrier of vocabulary and understand the rich semantic information behind vocabulary.

HowNet has incomparable advantages in integrating learning model. In WordNet, synonym forest and other knowledge bases, the meaning of each word is indirectly reflected by the synonym set and the definition (gloss). The specific meaning of each word is lack of precise fine-grained description and explicit quantitative information, which can not be better used by computers. HowNet can directly and accurately describe the semantic information of word meaning through a unified semantic tagging system, and each semantic meaning is clear and fixed, which can be directly integrated into the machine learning model as a semantic tag.

Maybe it's because HowNet adopts the policy of charging authorization, and it's mainly for the Chinese world. In recent years, the knowledge base of HowNet has faded out of people's view. However, with the gradual in-depth understanding of HowNet and our recent successful attempt to integrate HowNet with deep learning model, I began to believe that the knowledge system and ideas of HowNet language will shine brilliantly in the era of deep learning.

Our attempt

Recently, we explored the tasks of vocabulary representation learning, new word semantic recommendation, and dictionary extension, respectively, and verified the effectiveness of HowNet and deep learning model integration.

1. Vocabulary representation learning integrating original knowledge

We consider integrating the semantic knowledge into the vocabulary representation learning model. As early as 2016, Professor Sun Maosong of our group carried out this research. The related work was published in the National Conference on Computational Linguistics (CCL 2016) and Chinese Journal of information [7] on the topic of vector representation of words and meanings borrowed from artificial knowledge base: HowNet as an example. This ACL 2017 work is a further attempt in this regard. In this work, we visualize the primitive annotation information of HowNet into a word sense sememe structure as shown in the figure below. It should be noted that in order to simplify the model, we do not consider the semantic structure information of the word meaning, that is, we regard the semantic annotation of each word meaning as an unordered set.

Based on skip gram model in word2vec, we propose sat (sememe attention over target model) model. Compared with skip gram model, which only considers context information, sat model also considers the semantic information of words, and uses the semantic information assisted model to "understand" words better. The specific method is to disambiguate the meaning of the center word according to the context word, calculate the weight of the context to each word meaning (sense) using the attention mechanism, and then use the weighted average value of sense embedding to represent the word vector. The experimental results on word similarity calculation and analogical reasoning show that the performance of word vector can be effectively improved by integrating semantic information into vocabulary representation learning.

2. Recommendation of new words based on lexical representation

After verifying the complementary relationship between the distributed representation learning and the semantic knowledge base, we further propose whether we can use the vocabulary representation learning model to recommend the new words and assist the annotation of the knowledge base. In order to realize semantic recommendation, we explore the methods of matrix decomposition and collaborative filtering.

The matrix decomposition method first uses large-scale text data to learn word vectors, and then uses the original annotation of existing words to build the "word original" matrix, and then uses the matrix decomposition to build the original vector matching the word vector. When a new word is given, the word vector is used to recommend the original information. The collaborative filtering method uses word vectors to automatically find the most similar words with the given new words, and then uses the sememes of these similar words for recommendation. The experimental results show that the combination of matrix decomposition and collaborative filtering can effectively recommend the original meaning of new words, and to some extent, it can find the inconsistency of HowNet knowledge base. This technology will help to improve the efficiency and quality of HowNet knowledge base.

3. Dictionary expansion based on lexical representation and semantic knowledge

Recently, we have tried to use word representation learning and HowNet knowledge base to expand the dictionary. The dictionary extension task aims to automatically expand more related words according to the existing words in the dictionary. This task can be regarded as the classification of words. We choose the famous Chinese version of LIWC Dictionary (Linguistic Inquiry and word count) in sociology to carry out the research. Each word in the Chinese version of LIWC is labeled with a hierarchical psychological category. We use large-scale text data to learn the distributed vector representation of each word, then use LIWC dictionary words as training data to train classifiers, and build sememe attention with primitive annotation information provided by HowNet. Experiments show that the introduction of semantic information can significantly improve the hierarchical classification of words.

PS. it is worth mentioning that these three jobs are mainly completed by undergraduates (Niu Yilin, yuan Xingchi, Zeng Xiangkai). The model schemes are very simple, but they are all employed by ACL, IJCAI and AAAI for the first time. It can also be seen that the recognition of such technical routes by international academic circles.

Future outlook

The above three tasks only preliminarily verify the important role of HowNet language knowledge base in some tasks in the era of deep learning. At the end of "vector representation of words and meanings borrowed from artificial knowledge base: a case study of HowNet" [7], Professor Sun Maosong has a brilliant discussion on this technical route

How to integrate the human knowledge represented by HowNet language knowledge base and the data-driven model represented by deep learning? There are still many important open questions to be explored and answered. I think the following directions are of great exploration value:

In a word, HowNet knowledge base is a treasure that has been ignored after entering the era of deep learning. It may become a key to solve many bottlenecks of NLP deep learning model. In the era of deep learning, we can do things with HowNet, which has a broad world and great potential!