Abstract: Based on Solr, this paper configures Chinese word segmentation device, and summarizes its performance test, including the use of mmseg4j, ikanalyzer, ansj, respectively, from the aspects of index creation effect, index creation performance, data search efficiency. The specific usage of Solr assumes that the reader has already had a foundation. For the performance indicators of Solr, see the previous Solr blog.
Premise:
Solr provides a complete set of data retrieval scheme, a four core CPU, 16g memory machine, Gigabit network.
Demand:
1. There are certain requirements for the efficiency of Solr index creation.
2. The speed of Chinese word segmentation and search should be fast.
3. There are certain requirements for the accuracy of Chinese word segmentation.
Explain:
The following is a comparison of different Chinese word breakers configured on Solr.
1. Chinese participle
1.1 overview of Chinese word segmentation
Name
Recent updates
Speed (online intelligence)
Extensibility support, others
Mmseg4j
Two thousand and thirteen
Complex 60W words / S (1200 KB / s)
Simple 100W words / S (1900 KB / s)
Use sougou thesaurus or customize it
(complex\simple\MaxWord)
IKAnalyzer
Two thousand and twelve
Ik2012 160W words / S (3000kb / s)
Support user dictionary extension definition, support custom stop word
(smart \ fine grained)
Ansj
Two thousand and fourteen
Baseanalysis300w / S
Hlanalysis 40W words / S
It supports user-defined dictionary, can analyze part of speech, and has new word discovery function
Paoding
Two thousand and eight
100W word /s
Support unlimited number of user-defined Thesaurus
L note:
The Chinese word breaker may not be compatible with the latest version of Lucene. Tokenstream contractversion error occurs when the Chinese word breaker is configured. For mmseg4j, you need to change the source code of com.chenlb.mmseg4j.analysis.mmsegtokenizer, add super. Reset() in reset(), recompile and replace the original jar.
1.2 mmseg4j
L. create index effect:
Fieldvalue content:
Beijing Times reported on January 23, 2009 that a strong cold air from central Siberia had a strong wind to cool down the city. The highest temperature in the day was - 7 degrees Celsius degree, accompanied by a northerly wind of 6 to 7 degrees.
Add in Thesaurus:
JINGWAH, 뭄내, ぼおえ, received a share
type
Result
TextMaxWord
| celsius| degree| at the same time | with | 6| to | 7| level | northerly wind
TextComplex
Northern wind
TextSimple
Northern wind of scrotum
L index creation efficiency:
17 fields of various types. Based on the fields in Solr blog, select an empty string type field to change to a new type, and write the text content (the original plain text size is about 400B, and the solrinputdocument object size is about 1130b).
The text is composed of 20 words in the lexicon. Each word is about 3 words, and each sentence is about 60 words.
The total data volume is 2000 W pieces of data, the same configuration as Section 2.2.
Field type
Creation time (s)
Index size (GB)
Network (MB/s)
Rate (w / s)
TextMaxWord
Three thousand one hundred and fifteen
Four point nine five
Six
0.64 (38W words / s)
TextComplex
Four thousand eight hundred and sixty
Four point three
Five
0.41 (25W words / s)
TextSimple
Three thousand and twenty-seven
Four point three two
Six point five
0.66 (40W / s)
String
Two thousand three hundred and fifty
Nine point zero eight
Eight
0.85 (57W words / s)
Speed: in the same configuration as section 1.2 in "Solr blog", the creation speed of segmentation index is slower than that without segmentation.
Size: the size of the participle index is smaller than that of the one without participle. After testing, the configuration of the participle field to autoGeneratePhraseQueries= "false" has little impact on the index size.
L. data search efficiency:
The text content is composed of 20 words in the lexicon, each word is about 3 words, each sentence is about 60 words, and the total amount of data is 2000 W pieces of data.
Field type
Key word
Search time (MS)
Result (bar)
TextMaxWord
No, no, No
One hundred and eighty
Two thousand five hundred and fifty-six
TextComplex
No, no, No
Fifty-nine
Two thousand six hundred and forty-eight
TextSimple
No, no, No
Sixty-two
Two thousand six hundred and twenty-two
String
*No, no, No*
Twenty thousand
Two thousand six hundred and eighty-nine
TextMaxWord
One country, two systems
Twenty-two
Two thousand six hundred and twenty
TextComplex
One country, two systems
Twelve
Two thousand six hundred and eighty-seven
TextSimple
One country, two systems
Ten
Two thousand six hundred and seventy
String
*One country, two systems*
Fifteen thousand and five hundred
Two thousand six hundred and fifty-seven
TextMaxWord
some
Twenty-four
Fifteen thousand nine hundred and ninety-nine
TextComplex
some
Eleven
Two thousand six hundred and eighty-seven
TextSimple
some
Nine
Two thousand six hundred and sixty-five
String
* some *
Fourteen thousand and two hundred
Fifteen thousand seven hundred and fifty-eight
TextMaxWord
Toss and turn
Fifteen
Two thousand six hundred and twenty-two
TextComplex
Toss and turn
Five
Two thousand six hundred and thirty-two
TextSimple
Toss and turn
Nine
Two thousand six hundred and seventy-six
String
*Toss and turn*
Fifteen thousand and six hundred
Two thousand six hundred and sixty-five
L supplement:
For non Chinese, numerical and English words, including traditional characters, it is enough to add new words into the dictionary.
Mmseg4j can't separate "easy" from "all starts with easy". The result of segmentation is "all starts with easy".
Textmaxword type participle is recommended on the Internet.
1.3 IKAnalyzer
L. create index effect:
The content of fieldvalue and its supplement in thesaurus are the same as 1.2.
Word segmentation field configuration autogeneratephrasequeries = "false"
type
Result
Fine-grained
Under zero, under zero, under seven, Celsius, degree |7 | level | of | northerly wind | northerly wind | northerly wind
L. index creation efficiency:
Field type
Creation time (s)
Index size (GB)
Network (MB/s)
Rate (w / s)
Fine-grained
Three thousand five hundred and eighty-four
Five point zero six
Six
0.56 (33W / s)
Speed: compared with 1.2, the creation speed of segmentation index is slightly slower than that of mmseg4j.
Size: the size of segmentation index is slightly larger than that of mmseg4j.
L. data search efficiency:
Field type
Key word
Search time (MS)
Result (bar)
Fine-grained
No, no, No
Four hundred
Five million nine hundred and forty-nine thousand two hundred and fifty-five
Fine-grained
One country, two systems
Five hundred
Six million five hundred and fifty-eight thousand four hundred and forty-nine
Fine-grained
some
Three hundred
Five million three hundred and twelve thousand one hundred and three
Fine-grained
Toss and turn
Fifteen
Ten thousand five hundred and eighty-eight
L supplement:
Textmaxword in mmseg4j is divided into two parts: one is not to do, the other is not to do;
In ikanalyzer, the fine-grained "one does not do two endlessly" is divided into: one does not do two endlessly, one does not do two endlessly, two endlessly, two endlessly;
So we also use autogeneratephrasequeries = "false", "one does not do two endlessly" search, ikanalyzer search results are far more than mmseg4j.
1.4 Ansj
L. create index effect:
The content of fieldvalue is the same as 1.2, and there is no supplementary thesaurus.
<fieldType name=”text_ansj”class=”solr.TextField”>
<analyzertype=”index”>
<tokenizerclass=”org.ansj.solr.AnsjTokenizerFactory” conf=”ansj.conf”rmPunc=”true”/>
</analyzer>
<analyzertype=”query”>
<tokenizerclass=”org.ansj.solr.AnsjTokenizerFactory” analysisType=”1″rmPunc=”true”/>
</analyzer>
</fieldType>
Result
Celsius | Celsius | degree |, at the same time | accompanied by | with | 6 | to | 7 | degree | of | northerly | northerly | wind| 。
After the word "Jinghua" was segmented, it became "Jinghua". According to friends, it had a bug that changed the strange words into characters.
L. index creation efficiency:
Field type
Creation time (s)
Index size (GB)
Network (MB/s)
Rate (w / s)
Fine-grained
Three thousand eight hundred and fifteen
Five point seven six
Five point two
0.52 (31W / s)
Speed: compared with 1.2 and 1.3, the creation speed of segmentation index is slightly slower than that of using mmseg4j and ikanalyzer.
Size: the size of segmentation index is slightly larger than that of using mmseg4j and ikanalyzer.
L. data search efficiency:
Key word
Search time (MS)
Result (bar)
No, no, No
Two hundred
Two thousand four hundred and seventy-eight
One country, two systems
Fifteen
some
Twenty-five
Fifteen thousand six hundred and sixty-five
Toss and turn
Six
Two thousand six hundred and fifty-five
1.5 summary
Search according to the result after word segmentation. If autogeneratephrasequeries = "false" is configured in the word segmentation field, the search condition is word segmentation first, and then use word segmentation to search in the result. The default is true. Autogeneratephrasequeries = "false" has no effect on index creation speed and search results. You can also modify Solr's querypasser. For an input string, first segment the corresponding words, and then use the segmentation results to search in the index set.
Precise or fuzzy * search is based on words. Precise search is to return all results containing participles.
Word breaker can recognize word, letter, digit, etc.
For a string type that does not use word segmentation, you can only search by fuzzy search *, find hyphenation, and search by word.
Search within the segmentation index is faster; without segmentation, all documents need to be traversed, which is slower.
If word segmentation is needed, the speed of word segmentation is the main bottleneck.
In a word, mmseg4j is the first choice for Chinese word segmentation.
If you need specific test code, you can contact me.
Welcome to my love machine learning QQ14 group: 336582044
WeChat sweep away, I love machine learning official account.
Micro blog: I love machine learning