Hacking Book | Free Online Hacking Learning


introduction to question answering system based on knowledge map nlpcc2016kbqa data set

Posted by verstraete at 2020-02-25

By Guo Yazhi

Beijing University of Chemical Technology

Research direction: NLP, knowledge map, dialogue / Q & a system

I think that learning one thing, directly running experiment is one of the most effective ways to improve. After reading so many theoretical introduction articles and abstract words, I still have a vague understanding at last. Therefore, we can have a deeper understanding by directly starting the data set running experiment and combining it with the theory. At the same time, it also records the process of learning kbqa, and hopes to help the students with the introduction.

I am doing kbqa related work recently. I hope to learn more in the experiment after some general understanding and ideas.  

At present, kbqa focuses on simple knowledge base Q & A: according to a question, a triplet is extracted to generate SPARQL statement, and then the knowledge map query returns the answer.  

Finally, nlpcc2016kbqa data set is selected, and the baseline model is Bert.

Nlpcc is the conference on natural language processing and Chinese computing. It is an annual academic meeting of CCF Chinese Information Technology Professional Committee sponsored by China Computer Society (CCF). It focuses on the academic and application innovation in natural language processing and Chinese computing.  

The data set used for this time is from nlpcc iccpol 2016 kbqa task set, which contains 14609 Q & a training sets and 9870 Q & a test sets. It also provides a knowledge base, including 6502738 entities, 587875 attributes and 43063796 triples.

Each line of the knowledge base file stores a fact, which is a triple (entity, attribute, attribute value). The statistics of each document are as follows:

The sample knowledge base is as follows:

In the original data, there was only question answer, but no triple. The question answer data I used came from the preprocessing of the first place in the competition:


The way to construct triple is to find the answer from the knowledge base in the reverse direction, filter the entity according to the question, and finally filter out, there will also be a small amount of noise data. The triple is then used to build data sets for tasks such as entity recognition and attribute selection.

Examples of Q & A are as follows:

Ambiguity between knowledge base entities

Taking "Barack Obama" as an example, the questions and answers concerning the entity are as follows:

Query the triples containing the entity in the knowledge base, and the result is as follows (partial):

First of all, there are multiple entities of "Barack Obama" in the knowledge base, which may be the fusion of multiple data sources or other reasons, so it can not fully guarantee the alignment of information. When we look at the attribute of "wife", we find that some are "Michelle lavorn Obama" and some are "Michelle Obama", and the answer given in our Q & A is "Michelle Obama". So when our model retrieves the correct triples:

Although both entities and attributes are mapped correctly, the final answer may still be judged to be wrong.

Entity ambiguity in problems

Take "doctor visiting" as an example, the questions and answers related to the entity are as follows:

Query the triples containing the entity in the knowledge base, and the result is as follows (partial):

The question in the question is: "when did the doctor come to visit?"? ", which is related to the attribute of" time ". This work has been created by many people in different periods. We can't get the creation time of which artist we want to ask from the current question.

Therefore, the entities involved in the problem are ambiguous. Similarly, when the model retrieves the correct entities and attributes we think, it may still be judged as the wrong answer.

When there are too many triples of related entities in the knowledge base, it is also a challenge to the effect and efficiency of retrieval model.

In the knowledge base with 4300w triples, a large number of related triples (tens, hundreds) will be retrieved from the same entity, and in the case of the above two ambiguity problems, the recognition effect and efficiency are very big problems.

The above two questions have little influence on the experimental part of entity recognition and the attribute extraction part, but have great influence on the part of retrieving the final answer triplet in the entity linked knowledge base.

Cleaning training data, test data and knowledge base

Filter the attributes, remove the noise symbols such as' - ',' • ', and space, and change the lower () of each line to lower case.

Respectively saved as: train_clean.csv, test_clean.csv, nlpcc-iccpol-2016-clean.kbqa.kb.

Structure development set

The number of original training sets was 14609. After shuffle, 2609 was selected as the development set and the rest as the training set, as follows.

They are respectively saved as: train_clean.csv, dev_clean.csv, test_clean.csv.

Construct training set, development set and test set of entity recognition

To construct entity recognition data set, we need to label the question according to the problem of reverse annotation of triple entity. Because we want to extract a single entity from a single problem, instead of using the bio annotation, we directly use the 0 / 1 annotation method, that is, 0 represents non entity and 1 represents entity.

At the same time, we need to ensure that its entity integrity appears in the problem. For the example that does not appear, we directly delete and ignore it. An example of the error is as follows:

The filtered dataset information is as follows:

An example of the filtered dataset is as follows:

Save them as: entity_train.csv, entity_dev.csv, and entity_test.csv.

The experimental results based on the Bert + bilstm + CRF model are as follows, in which accuracy is the accuracy of identifying the perfect matching entity from 9556 questions.

The examples of incomplete matching entities are as follows: some are recognition errors, some are synonyms, some are noise problems.

Construct training set, development set and test set for attribute extraction

1. Construct the overall attribute set of the test set, extract + de duplicate, and obtain 4373 attribute relationlists;

2. A sample is composed of "problem + attribute + label", and the attribute value in the original data is set to 1;

3. Randomly select five attributes from relationlist as negative samples.

The dataset size is as follows:

The data set samples are as follows:

They are saved as: relation_train.csv, relation_dev.csv, and relation_test.csv.

Before the construction of the data for training, test results on this test set. The training results based on Bert are as follows, where accuracy is the real accuracy.

The test examples that are not recognized by the model are as follows, and it can be seen that the ability of deep semantic matching is lacking.

After that, I will open source relevant codes and preprocessing data to my GitHub:


Click on the following title to see more previous content:

The latest review of automatic machine learning (automl)

Figure overview of neural networks: models and Applications

10 recent Gan progress papers worth reading

Language model pre training in natural language processing

Understanding the generalization ability of deep learning from the perspective of Fourier analysis

Two lines of code playing with vector word vector of Google Bert sentence

Which papers have you read recommended by the recent knowledge atlas summit?

Tensorspace: super cool 3D neural network visualization framework

Deep Longman: NLP's giant shoulder (I)

NLP's giant shoulder (2): from cove to Bert

#Submission channel#

Let your paper be seen by more people

How can we make more high-quality content reach the reader group in a shorter path, and shorten the cost of finding high-quality content for readers? The answer is: people you don't know.

There are always people you don't know who know what you want to know. Paper weekly may be a bridge for scholars and academic inspiration from different backgrounds and directions to collide with each other and burst out more possibilities.  

Paperweekly encourages university laboratories or individuals to share all kinds of high-quality content on our platform, which can be the latest paper interpretation, learning experience or technology dry goods. We have only one purpose, to let knowledge flow.

A kind of Standard of contribution:

• the manuscript is really an original work, and the author's personal information (name + school / work unit + education background / position + research direction) shall be indicated in the manuscript

• if the article is not in the first place, please remind and attach all published links at the time of submission

• paperweekly is the first article by default, and the "original" logo will be added

A kind of Submission email:

Contribution email: [email protected]

• please send all article drawings separately in the attachment

• please leave an instant contact (wechat or mobile phone) so that we can communicate with the author when editing and publishing

A kind of

Now we can be found in Zhihu

Enter Zhihu homepage to search for "paperweekly"

Click "follow" to subscribe to our column

About paperweekly

Paperweekly is an academic platform to recommend, interpret, discuss and report the achievements of advanced papers on artificial intelligence. If you study or engage in AI, welcome to click on the "official account" in the background of public numbers, and the little assistant will bring you into the PaperWeekly communication group.

▽ click | read the original | to get the latest paper recommendation