At the end of December 2017, academician Zhang cymbal of Tsinghua University made a speech entitled "what should professors see on the eve of AI scientific breakthrough?"? 》Excellent special report. He believes that processing knowledge is what human beings are good at, and processing data is what computers are good at. If we can combine the two, we will be able to build a more intelligent system than human beings. Therefore, he proposed that the future scientific breakthrough of AI is to build an AI system based on knowledge and data at the same time.
I totally agree with the academic viewpoint of teacher Zhang Po. In the last year, we have also made some attempts in this regard. We have integrated the semantic annotation information in HowNet into NLP oriented deep learning model and achieved some interesting results. Let's organize and share with you here.
What is HowNet
HowNet is a large-scale language knowledge base marked by Mr. Dong Zhendong and Mr. Dong Qiang's father and son's achievements in the past 30 years. It mainly focuses on Chinese (including English) vocabulary and concepts [1]. HowNet adheres to reductionism and believes that words / meanings can be described in smaller semantic units. This kind of semantic unit is called sememe. As the name implies, it is atomic semantics, that is, the most basic and the smallest semantic unit that is not suitable to be subdivided. In the process of annotation, HowNet has gradually built a set of precise semantic system (about 2000 sememes). HowNet has annotated hundreds of thousands of lexical / semantic information based on the original semantic system.
For example, the word "vertex" has two representative semantic items in HowNet, which are labeled as follows: each "xx|yy" represents a semantic element, the left side of "|" is English and the right side is Chinese; complex semantic relations are also labeled between the semantic elements, such as host, modifier, long, etc., so as to accurately represent the semantic information of the word meaning.
Knowledge base resources have always played an important role in NLP field. WordNet is the most famous one in the English world. It uses the form of synonym set to mark the semantic knowledge of vocabulary / meaning. HowNet takes a different approach from WordNet, which can be said to be the most unique outstanding contribution made by Chinese scholars for NLP. HowNet has aroused great research enthusiasm in NLP academic circle around 2000. It explores the important application value of HowNet in terms of lexical similarity calculation, text classification, information retrieval, etc. [2,3], which contrasts with the international application exploration of WordNet at that time.
What is the use of HowNet in the era of deep learning
In the era of deep learning, people find that large-scale text data can also learn the semantic representation of vocabulary well. For example, the word representation learning method represented by word2vec [4] uses low dimension (generally hundreds of dimensions), dense, real value vectors to represent the semantic information of each word / word meaning, also known as distributed representation, or embedding, which uses the word context information in large-scale text to automatically learn the vector representation. We can use these vectors to calculate lexical / semantic similarity conveniently, which can achieve better results than traditional methods based on language knowledge base. Because of this, academic attention of HowNet and WordNet has declined significantly in recent years, as shown in the following two figures.
So, is the language knowledge base represented by WordNet and HowNet useless in the era of deep learning? This is not the case. In fact, since word2vec was just put forward a year ago, we [5] and ACL 2015 best student paper [6] and other work have found that integrating WordNet knowledge into the learning process of word representation can effectively improve the effect of word representation.
Although most NLP deep learning models haven't set aside a place for the language knowledge base at present, due to the characteristics of deep learning models such as data Hungary and black box, their development is encountering a bottleneck that can't be broken through. Looking back at academician Zhang cymbal's point of view, we firmly believe that the future scientific breakthrough of AI is to build an AI system based on knowledge and data at the same time. Seeing this situation clearly, the key problem of NLP deep learning model lies in what knowledge to use and how to use it.
In natural language understanding, HowNet is more close to the nature of language. Words in natural language are typical symbolic information, which contains rich semantic information. It can be said that vocabulary is the smallest unit of language use, but not the smallest semantic unit. HowNet's semantic tagging system is just an important way to break through the barrier of vocabulary and understand the rich semantic information behind vocabulary.
HowNet has incomparable advantages in integrating learning model. In WordNet, synonym forest and other knowledge bases, the meaning of each word is indirectly reflected by the synonym set and the definition (gloss). The specific meaning of each word is lack of precise fine-grained description and explicit quantitative information, which can not be better used by computers. HowNet can directly and accurately describe the semantic information of word meaning through a unified semantic tagging system, and each semantic meaning is clear and fixed, which can be directly integrated into the machine learning model as a semantic tag.
Maybe it's because HowNet adopts the policy of charging authorization, and it's mainly for the Chinese world. In recent years, the knowledge base of HowNet has faded out of people's view. However, with the gradual in-depth understanding of HowNet and our recent successful attempt to integrate HowNet with deep learning model, I began to believe that the knowledge system and ideas of HowNet language will shine brilliantly in the era of deep learning.
Our attempt
Recently, we explored the tasks of vocabulary representation learning, new word semantic recommendation, and dictionary extension, respectively, and verified the effectiveness of HowNet and deep learning model integration.
1. Vocabulary representation learning integrating original knowledge
We consider integrating the semantic knowledge into the vocabulary representation learning model. As early as 2016, Professor Sun Maosong of our group carried out this research. The related work was published in the National Conference on Computational Linguistics (CCL 2016) and Chinese Journal of information [7] on the topic of vector representation of words and meanings borrowed from artificial knowledge base: HowNet as an example. This ACL 2017 work is a further attempt in this regard. In this work, we visualize the primitive annotation information of HowNet into a word sense sememe structure as shown in the figure below. It should be noted that in order to simplify the model, we do not consider the semantic structure information of the word meaning, that is, we regard the semantic annotation of each word meaning as an unordered set.
Based on skip gram model in word2vec, we propose sat (sememe attention over target model) model. Compared with skip gram model, which only considers context information, sat model also considers the semantic information of words, and uses the semantic information assisted model to "understand" words better. The specific method is to disambiguate the meaning of the center word according to the context word, calculate the weight of the context to each word meaning (sense) using the attention mechanism, and then use the weighted average value of sense embedding to represent the word vector. The experimental results on word similarity calculation and analogical reasoning show that the performance of word vector can be effectively improved by integrating semantic information into vocabulary representation learning.
2. Recommendation of new words based on lexical representation
After verifying the complementary relationship between the distributed representation learning and the semantic knowledge base, we further propose whether we can use the vocabulary representation learning model to recommend the new words and assist the annotation of the knowledge base. In order to realize semantic recommendation, we explore the methods of matrix decomposition and collaborative filtering.
The matrix decomposition method first uses large-scale text data to learn word vectors, and then uses the original annotation of existing words to build the "word original" matrix, and then uses the matrix decomposition to build the original vector matching the word vector. When a new word is given, the word vector is used to recommend the original information. The collaborative filtering method uses word vectors to automatically find the most similar words with the given new words, and then uses the sememes of these similar words for recommendation. The experimental results show that the combination of matrix decomposition and collaborative filtering can effectively recommend the original meaning of new words, and to some extent, it can find the inconsistency of HowNet knowledge base. This technology will help to improve the efficiency and quality of HowNet knowledge base.
3. Dictionary expansion based on lexical representation and semantic knowledge
Recently, we have tried to use word representation learning and HowNet knowledge base to expand the dictionary. The dictionary extension task aims to automatically expand more related words according to the existing words in the dictionary. This task can be regarded as the classification of words. We choose the famous Chinese version of LIWC Dictionary (Linguistic Inquiry and word count) in sociology to carry out the research. Each word in the Chinese version of LIWC is labeled with a hierarchical psychological category. We use large-scale text data to learn the distributed vector representation of each word, then use LIWC dictionary words as training data to train classifiers, and build sememe attention with primitive annotation information provided by HowNet. Experiments show that the introduction of semantic information can significantly improve the hierarchical classification of words.
PS. it is worth mentioning that these three jobs are mainly completed by undergraduates (Niu Yilin, yuan Xingchi, Zeng Xiangkai). The model schemes are very simple, but they are all employed by ACL, IJCAI and AAAI for the first time. It can also be seen that the recognition of such technical routes by international academic circles.
Future outlook
The above three tasks only preliminarily verify the important role of HowNet language knowledge base in some tasks in the era of deep learning. At the end of "vector representation of words and meanings borrowed from artificial knowledge base: a case study of HowNet" [7], Professor Sun Maosong has a brilliant discussion on this technical route
How to integrate the human knowledge represented by HowNet language knowledge base and the data-driven model represented by deep learning? There are still many important open questions to be explored and answered. I think the following directions are of great exploration value:
- At present, the research work is still at the lexical level, and the application of HowNet knowledge is very limited. How to effectively integrate HowNet semantic knowledge base in RNN / LSTM as the representative language model, and verify the effectiveness in automatic question answering, machine translation and other application tasks, has important research value. Whether we need to consider the structural information of semantic annotation is also worth exploring and thinking.
- After decades of careful annotation, HowNet knowledge base has a considerable scale, but in the face of the changing information age, the coverage of open domain vocabulary is still insufficient. It is necessary to explore more accurate automatic recommender technology of new words, so that computer-aided human experts can label knowledge base in a more timely and efficient way. In addition, due to the large scale and long time span of HowNet knowledge base, it is inevitable that there will be inconsistencies in labeling, which will greatly affect the effect of related models. Therefore, it is necessary to explore relevant algorithms to assist human experts in consistency detection and quality control of knowledge base.
- The semantic system of HowNet knowledge base is the crystallization of experts' reflection and summary in the process of continuous annotation. However, the original system of righteousness is not immutable and perfect. It should evolve with time and expand with the development of language understanding. We need to explore a combination of data-driven and expert driven means to optimize and expand the semantic system to better meet the needs of natural language processing.
In a word, HowNet knowledge base is a treasure that has been ignored after entering the era of deep learning. It may become a key to solve many bottlenecks of NLP deep learning model. In the era of deep learning, we can do things with HowNet, which has a broad world and great potential!
Reference
- Official introduction of HowNet.
- Liu Qun, Li Sujian. Calculation of lexical semantic similarity based on HowNet. Chinese Computational Linguistics 7, No. 2 (2002): 59-76
- Zhu Yanlan, min Jin, Zhou Yaqian, Huang xuanjing, Wu Lide. Calculation of lexical semantic tendency based on HowNet. Chinese Journal of information 20, No. 1 (2006): 16-22
- Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111-3119. 2013.
- Chen, Xinxiong, Zhiyuan Liu, and Maosong Sun. A unified model for word sense representation and disambiguation. In EMNLP, pp. 1025-1035. 2014.
- Rothe, Sascha, and Hinrich Schütze. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. In ACL, 2015.
- Sun Maosong, Chen Xinxiong. Vector representation of words and meanings borrowed from artificial knowledge base: a case study of HowNet. Chinese Journal of information 30, No. 6 (2016): 1-6. [Download]
- Yilin Niu, Ruobing Xie, Zhiyuan Liu, Maosong Sun. Improved Word Representation Learning with Sememes. In ACL, 2017.
- Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, Maosong Sun. Lexical Sememe Prediction via Word Embeddings and Matrix Factorization. In IJCAI, 2017.
- Xiangkai Zeng, Cheng Yang, Cunchao Tu, Zhiyuan Liu, Maosong Sun. Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention. In AAAI, 2018.