Hacking Book | Free Online Hacking Learning

Home

performance comparison of chinese word segmentation device - i love machine learning

Posted by patinella at 2020-02-27
all

Abstract: Based on Solr, this paper configures Chinese word segmentation device, and summarizes its performance test, including the use of mmseg4j, ikanalyzer, ansj, respectively, from the aspects of index creation effect, index creation performance, data search efficiency. The specific usage of Solr assumes that the reader has already had a foundation. For the performance indicators of Solr, see the previous Solr blog.

Premise:

Solr provides a complete set of data retrieval scheme, a four core CPU, 16g memory machine, Gigabit network.

Demand:

1. There are certain requirements for the efficiency of Solr index creation.

2. The speed of Chinese word segmentation and search should be fast.

3. There are certain requirements for the accuracy of Chinese word segmentation.

Explain:

The following is a comparison of different Chinese word breakers configured on Solr.

1. Chinese participle

1.1 overview of Chinese word segmentation

Name

Recent updates

Speed (online intelligence)

Extensibility support, others

Mmseg4j

Two thousand and thirteen

Complex 60W words / S (1200 KB / s)

Simple 100W words / S (1900 KB / s)

Use sougou thesaurus or customize it

(complex\simple\MaxWord)

IKAnalyzer

Two thousand and twelve

Ik2012 160W words / S (3000kb / s)

Support user dictionary extension definition, support custom stop word

(smart \ fine grained)

Ansj

Two thousand and fourteen

Baseanalysis300w / S

Hlanalysis 40W words / S

It supports user-defined dictionary, can analyze part of speech, and has new word discovery function

Paoding

Two thousand and eight

100W word /s

Support unlimited number of user-defined Thesaurus

L note:

The Chinese word breaker may not be compatible with the latest version of Lucene. Tokenstream contractversion error occurs when the Chinese word breaker is configured. For mmseg4j, you need to change the source code of com.chenlb.mmseg4j.analysis.mmsegtokenizer, add super. Reset() in reset(), recompile and replace the original jar.

1.2  mmseg4j

L. create index effect:

Fieldvalue content:

Beijing Times reported on January 23, 2009 that a strong cold air from central Siberia had a strong wind to cool down the city. The highest temperature in the day was - 7 degrees Celsius degree, accompanied by a northerly wind of 6 to 7 degrees.

Add in Thesaurus:

JINGWAH, 뭄내, ぼおえ, received a share

type

Result

TextMaxWord

| celsius| degree| at the same time | with | 6| to | 7| level | northerly wind

TextComplex

Northern wind

TextSimple

Northern wind of scrotum

L index creation efficiency:

17 fields of various types. Based on the fields in Solr blog, select an empty string type field to change to a new type, and write the text content (the original plain text size is about 400B, and the solrinputdocument object size is about 1130b).

The text is composed of 20 words in the lexicon. Each word is about 3 words, and each sentence is about 60 words.

The total data volume is 2000 W pieces of data, the same configuration as Section 2.2.

Field type

Creation time (s)

Index size (GB)

Network (MB/s)

Rate (w / s)

TextMaxWord

Three thousand one hundred and fifteen

Four point nine five

Six

0.64 (38W words / s)

TextComplex

Four thousand eight hundred and sixty

Four point three

Five

0.41 (25W words / s)

TextSimple

Three thousand and twenty-seven

Four point three two

Six point five

0.66 (40W / s)

String

Two thousand three hundred and fifty

Nine point zero eight

Eight

0.85 (57W words / s)

Speed: in the same configuration as section 1.2 in "Solr blog", the creation speed of segmentation index is slower than that without segmentation.

Size: the size of the participle index is smaller than that of the one without participle. After testing, the configuration of the participle field to autoGeneratePhraseQueries= "false" has little impact on the index size.

L. data search efficiency:

The text content is composed of 20 words in the lexicon, each word is about 3 words, each sentence is about 60 words, and the total amount of data is 2000 W pieces of data.

Field type

Key word

Search time (MS)

Result (bar)

TextMaxWord

No, no, No

One hundred and eighty

Two thousand five hundred and fifty-six

TextComplex

No, no, No

Fifty-nine

Two thousand six hundred and forty-eight

TextSimple

No, no, No

Sixty-two

Two thousand six hundred and twenty-two

String

*No, no, No*

Twenty thousand

Two thousand six hundred and eighty-nine

TextMaxWord

One country, two systems

Twenty-two

Two thousand six hundred and twenty

TextComplex

One country, two systems

Twelve

Two thousand six hundred and eighty-seven

TextSimple

One country, two systems

Ten

Two thousand six hundred and seventy

String

*One country, two systems*

Fifteen thousand and five hundred

Two thousand six hundred and fifty-seven

TextMaxWord

some

Twenty-four

Fifteen thousand nine hundred and ninety-nine

TextComplex

some

Eleven

Two thousand six hundred and eighty-seven

TextSimple

some

Nine

Two thousand six hundred and sixty-five

String

* some *

Fourteen thousand and two hundred

Fifteen thousand seven hundred and fifty-eight

TextMaxWord

Toss and turn

Fifteen

Two thousand six hundred and twenty-two

TextComplex

Toss and turn

Five

Two thousand six hundred and thirty-two

TextSimple

Toss and turn

Nine

Two thousand six hundred and seventy-six

String

*Toss and turn*

Fifteen thousand and six hundred

Two thousand six hundred and sixty-five

L supplement:

For non Chinese, numerical and English words, including traditional characters, it is enough to add new words into the dictionary.

Mmseg4j can't separate "easy" from "all starts with easy". The result of segmentation is "all starts with easy".

Textmaxword type participle is recommended on the Internet.

1.3  IKAnalyzer

L. create index effect:

The content of fieldvalue and its supplement in thesaurus are the same as 1.2.

Word segmentation field configuration autogeneratephrasequeries = "false"

type

Result

Fine-grained

Under zero, under zero, under seven, Celsius, degree |7 | level | of | northerly wind | northerly wind | northerly wind

L. index creation efficiency:

Field type

Creation time (s)

Index size (GB)

Network (MB/s)

Rate (w / s)

Fine-grained

Three thousand five hundred and eighty-four

Five point zero six

Six

0.56 (33W / s)

Speed: compared with 1.2, the creation speed of segmentation index is slightly slower than that of mmseg4j.

Size: the size of segmentation index is slightly larger than that of mmseg4j.

L. data search efficiency:

Field type

Key word

Search time (MS)

Result (bar)

Fine-grained

No, no, No

Four hundred

Five million nine hundred and forty-nine thousand two hundred and fifty-five

Fine-grained

One country, two systems

Five hundred

Six million five hundred and fifty-eight thousand four hundred and forty-nine

Fine-grained

some

Three hundred

Five million three hundred and twelve thousand one hundred and three

Fine-grained

Toss and turn

Fifteen

Ten thousand five hundred and eighty-eight

L supplement:

Textmaxword in mmseg4j is divided into two parts: one is not to do, the other is not to do;

In ikanalyzer, the fine-grained "one does not do two endlessly" is divided into: one does not do two endlessly, one does not do two endlessly, two endlessly, two endlessly;

So we also use autogeneratephrasequeries = "false", "one does not do two endlessly" search, ikanalyzer search results are far more than mmseg4j.

1.4 Ansj

L. create index effect:

The content of fieldvalue is the same as 1.2, and there is no supplementary thesaurus.

<fieldType name=”text_ansj”class=”solr.TextField”>

<analyzertype=”index”>

<tokenizerclass=”org.ansj.solr.AnsjTokenizerFactory” conf=”ansj.conf”rmPunc=”true”/>

</analyzer>

<analyzertype=”query”>

<tokenizerclass=”org.ansj.solr.AnsjTokenizerFactory” analysisType=”1″rmPunc=”true”/>

</analyzer>

</fieldType>

Result

Celsius | Celsius | degree |, at the same time | accompanied by | with | 6 | to | 7 | degree | of | northerly | northerly | wind| 。

After the word "Jinghua" was segmented, it became "Jinghua". According to friends, it had a bug that changed the strange words into characters.

L. index creation efficiency:

Field type

Creation time (s)

Index size (GB)

Network (MB/s)

Rate (w / s)

Fine-grained

Three thousand eight hundred and fifteen

Five point seven six

Five point two

0.52 (31W / s)

Speed: compared with 1.2 and 1.3, the creation speed of segmentation index is slightly slower than that of using mmseg4j and ikanalyzer.

Size: the size of segmentation index is slightly larger than that of using mmseg4j and ikanalyzer.

L. data search efficiency:

Key word

Search time (MS)

Result (bar)

No, no, No

Two hundred

Two thousand four hundred and seventy-eight

One country, two systems

Fifteen

some

Twenty-five

Fifteen thousand six hundred and sixty-five

Toss and turn

Six

Two thousand six hundred and fifty-five

1.5 summary

Search according to the result after word segmentation. If autogeneratephrasequeries = "false" is configured in the word segmentation field, the search condition is word segmentation first, and then use word segmentation to search in the result. The default is true. Autogeneratephrasequeries = "false" has no effect on index creation speed and search results. You can also modify Solr's querypasser. For an input string, first segment the corresponding words, and then use the segmentation results to search in the index set.

Precise or fuzzy * search is based on words. Precise search is to return all results containing participles.

Word breaker can recognize word, letter, digit, etc.

For a string type that does not use word segmentation, you can only search by fuzzy search *, find hyphenation, and search by word.

Search within the segmentation index is faster; without segmentation, all documents need to be traversed, which is slower.

If word segmentation is needed, the speed of word segmentation is the main bottleneck.

In a word, mmseg4j is the first choice for Chinese word segmentation.

If you need specific test code, you can contact me.

Welcome to my love machine learning QQ14 group: 336582044

WeChat sweep away, I love machine learning official account.

Micro blog: I love machine learning