Analyzer在构建索引的时候,还有分析查询字符串(query string)时候会用到。
Lucene的Analyzer是一个pineline的机制,由一个 Tokenizer + N个 TokenFilter 构成,N>=0。Tokenizer之前还可以配置N个 CharFilter。
其中各个部件的职责如下:
character filters
Character filters are used to preprocess the string of characters before it is passed to the tokenizer. A character filter may be used to strip out HTML markup, , or to convert “&” characters to the word “and”.
tokenizers
Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation.
token filters
Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).
ES或者说Lucene大概有如下组件:
- character filters
- tokenizers
- token filters
- standard token filter
- ascii folding token filter
- length token filter
- lowercase token filter
- uppercase token filter
- ngram token filter
- edge ngram token filter
- porter stem token filter 词干过滤器,必须在lowercase filter/tokenizer之后
- shingle token filter
- stop token filter
- word delimiter token filter
- stemmer token filter
- stemmer override token filter
- keyword marker token filter
- keyword repeat token filter
- kstem token filter
- snowball token filter
- phonetic token filter
- synonym token filter 同义词
- compound word token filter
- reverse token filter
- elision token filter
- truncate token filter
- unique token filter
- pattern capture token filter
- pattern replace token filter
- trim token filter
- limit token count token filter
- hunspell token filter
- common grams token filter
- normalization token filter
- cjk width token filter
- cjk bigram token filter
- delimited payload token filter
- keep words token filter
- keep types token filter
- classic token filter
- apostrophe token filter
在ES配置自定义的analyzer是很简单事情,下面是一个例子:
index :
analysis :
analyzer :
myAnalyzer2 :
type : custom
tokenizer : myTokenizer1
filter : [myTokenFilter1, myGreekLowerCaseFilter]
char_filter : [my_html]
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myGreekLowerCaseFilter :
type : lowercase
language : greek
char_filter :
my_html :
type : html_strip
escaped_tags : [xxx, yyy]
read_ahead : 1024
另一个例子:
index :
analysis :
analyzer :
standard :
type : standard
stopwords : [stop1, stop2]
myAnalyzer1 :
type : standard
stopwords : [stop1, stop2, stop3]
max_token_length : 500
# configure a custom analyzer which is
# exactly like the default standard analyzer
myAnalyzer2 :
tokenizer : standard
filter : [standard, lowercase, stop]
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
myTokenizer2 :
type : keyword
buffer_size : 512
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type : length
min : 0
max : 2000
ES已经内建了很多analyzer,一般情况下可以直接使用,不需要自定义。事实上,这些buildin analyzer也是由上面的charfilter、tokenizer和tokenfilter构成的。而另一方面,如果内建的analyzer不符合你的要求,可以很方便的通过custom analyzers进行自定义。
- standard analyzer
- Standard Tokenizer => Standard Token Filter => Lower Case Token Filter => Stop Token Filter
- simple analyzer
- Lower Case Tokenizer
- whitespace analyzer
- Whiterspace Tokenizer
- stop analyzer
- Lower Case Tokenizer => Stop Token Filter
- keyword analyzer
- pattern analyzer
- snowball analyzer
- standard tokenizer => standard filter => lowercase filter => stop filter => snowball filter
- language analyzers
- stopwords
- excluding words from stemming
- The built-in language analyzers can be reimplemented as custom analyzers (as described below) in order to customize their behaviour.
- arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.
- custom analyzer
- tokenizer: The logical / registered name of the tokenizer to use.
- filter: An optional list of logical / registered name of token filters.
- char_filter: An optional list of logical / registered name of char filters.
TIPS
上面列出的都是ES内建的组件,如果不能满足的你要求,可以找一下有没有相关的插件可以使用。 比如这个 [icu analysis plugin](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-icu-plugin.html]。
我们会用到的语言分析器
1. arabic analyzer
The arabic analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"arabic_stop": {
"type": "stop",
"stopwords": "_arabic_"
},
"arabic_keywords": {
"type": "keyword_marker",
"keywords": []
},
"arabic_stemmer": {
"type": "stemmer",
"language": "arabic"
}
},
"analyzer": {
"arabic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"arabic_stop",
"arabic_normalization",
"arabic_keywords",
"arabic_stemmer"
]
}
}
}
}
}
2. cjk analyzer
The cjk analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"cjk": {
"tokenizer": "standard",
"filter": [
"cjk_width",
"lowercase",
"cjk_bigram",
"english_stop"
]
}
}
}
}
}
我们主要是针对日语。如果cjk不好用,可以考虑使用这个分析器插件:Japanese (Kuromoji) Analysis plugin.
3. english analyzer
The english analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": []
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
4. hindi analyzer
The hindi analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"hindi_stop": {
"type": "stop",
"stopwords": "_hindi_"
},
"hindi_keywords": {
"type": "keyword_marker",
"keywords": []
},
"hindi_stemmer": {
"type": "stemmer",
"language": "hindi"
}
},
"analyzer": {
"hindi": {
"tokenizer": "standard",
"filter": [
"lowercase",
"indic_normalization",
"hindi_normalization",
"hindi_stop",
"hindi_keywords",
"hindi_stemmer"
]
}
}
}
}
}
5. indonesian analyzer
The indonesian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"indonesian_stop": {
"type": "stop",
"stopwords": "_indonesian_"
},
"indonesian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"indonesian_stemmer": {
"type": "stemmer",
"language": "indonesian"
}
},
"analyzer": {
"indonesian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"indonesian_stop",
"indonesian_keywords",
"indonesian_stemmer"
]
}
}
}
}
}
6. portuguese analyzer
The portuguese analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"portuguese_stop": {
"type": "stop",
"stopwords": "_portuguese_"
},
"portuguese_keywords": {
"type": "keyword_marker",
"keywords": []
},
"portuguese_stemmer": {
"type": "stemmer",
"language": "light_portuguese"
}
},
"analyzer": {
"portuguese": {
"tokenizer": "standard",
"filter": [
"lowercase",
"portuguese_stop",
"portuguese_keywords",
"portuguese_stemmer"
]
}
}
}
}
}
7. thai analyzer
The thai analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"thai_stop": {
"type": "stop",
"stopwords": "_thai_"
}
},
"analyzer": {
"thai": {
"tokenizer": "thai",
"filter": [
"lowercase",
"thai_stop"
]
}
}
}
}
}