ElasticSearch的Analyzer - 郑志彬的博客

Analyzer在构建索引的时候，还有分析查询字符串(query string)时候会用到。

Lucene的Analyzer是一个pineline的机制，由一个 Tokenizer + N个 TokenFilter 构成，N>=0。Tokenizer之前还可以配置N个 CharFilter。

其中各个部件的职责如下：

character filters

Character filters are used to preprocess the string of characters before it is passed to the tokenizer. A character filter may be used to strip out HTML markup, , or to convert “&” characters to the word “and”.

tokenizers

Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation.

token filters

Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).

ES或者说Lucene大概有如下组件：

在ES配置自定义的analyzer是很简单事情，下面是一个例子：

index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : myTokenizer1
                filter : [myTokenFilter1, myGreekLowerCaseFilter]
                char_filter : [my_html]
        tokenizer :
            myTokenizer1 :
                type : standard
                max_token_length : 900
        filter :
            myTokenFilter1 :
                type : stop
                stopwords : [stop1, stop2, stop3, stop4]
            myGreekLowerCaseFilter :
                type : lowercase
                language : greek
        char_filter :
              my_html :
                type : html_strip
                escaped_tags : [xxx, yyy]
                read_ahead : 1024

另一个例子：

index :
    analysis :
        analyzer :
            standard :
                type : standard
                stopwords : [stop1, stop2]
            myAnalyzer1 :
                type : standard
                stopwords : [stop1, stop2, stop3]
                max_token_length : 500
            # configure a custom analyzer which is
            # exactly like the default standard analyzer
            myAnalyzer2 :
                tokenizer : standard
                filter : [standard, lowercase, stop]
        tokenizer :
            myTokenizer1 :
                type : standard
                max_token_length : 900
            myTokenizer2 :
                type : keyword
                buffer_size : 512
        filter :
            myTokenFilter1 :
                type : stop
                stopwords : [stop1, stop2, stop3, stop4]
            myTokenFilter2 :
                type : length
                min : 0
                max : 2000

ES已经内建了很多analyzer，一般情况下可以直接使用，不需要自定义。事实上，这些buildin analyzer也是由上面的charfilter、tokenizer和tokenfilter构成的。而另一方面，如果内建的analyzer不符合你的要求，可以很方便的通过custom analyzers进行自定义。

standard analyzer
- Standard Tokenizer => Standard Token Filter => Lower Case Token Filter => Stop Token Filter
simple analyzer
- Lower Case Tokenizer
whitespace analyzer
- Whiterspace Tokenizer
stop analyzer
- Lower Case Tokenizer => Stop Token Filter
keyword analyzer
pattern analyzer
snowball analyzer
- standard tokenizer => standard filter => lowercase filter => stop filter => snowball filter
language analyzers
- stopwords
- excluding words from stemming
- The built-in language analyzers can be reimplemented as custom analyzers (as described below) in order to customize their behaviour.
- arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.
custom analyzer
- tokenizer: The logical / registered name of the tokenizer to use.
- filter: An optional list of logical / registered name of token filters.
- char_filter: An optional list of logical / registered name of char filters.

TIPS

上面列出的都是ES内建的组件，如果不能满足的你要求，可以找一下有没有相关的插件可以使用。比如这个 [icu analysis plugin](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-icu-plugin.html]。

我们会用到的语言分析器

1. arabic analyzer

The arabic analyzer could be reimplemented as a custom analyzer as follows:

{
  "settings": {
    "analysis": {
      "filter": {
        "arabic_stop": {
          "type":       "stop",
          "stopwords":  "_arabic_" 
        },
        "arabic_keywords": {
          "type":       "keyword_marker",
          "keywords":   [] 
        },
        "arabic_stemmer": {
          "type":       "stemmer",
          "language":   "arabic"
        }
      },
      "analyzer": {
        "arabic": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "arabic_stop",
            "arabic_normalization",
            "arabic_keywords",
            "arabic_stemmer"
          ]
        }
      }
    }
  }
}

2. cjk analyzer

The cjk analyzer could be reimplemented as a custom analyzer as follows:

{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        }
      },
      "analyzer": {
        "cjk": {
          "tokenizer":  "standard",
          "filter": [
            "cjk_width",
            "lowercase",
            "cjk_bigram",
            "english_stop"
          ]
        }
      }
    }
  }
}

我们主要是针对日语。如果cjk不好用，可以考虑使用这个分析器插件：Japanese (Kuromoji) Analysis plugin.

3. english analyzer

The english analyzer could be reimplemented as a custom analyzer as follows:

{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   [] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

4. hindi analyzer

The hindi analyzer could be reimplemented as a custom analyzer as follows:

{
  "settings": {
    "analysis": {
      "filter": {
        "hindi_stop": {
          "type":       "stop",
          "stopwords":  "_hindi_" 
        },
        "hindi_keywords": {
          "type":       "keyword_marker",
          "keywords":   [] 
        },
        "hindi_stemmer": {
          "type":       "stemmer",
          "language":   "hindi"
        }
      },
      "analyzer": {
        "hindi": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "indic_normalization",
            "hindi_normalization",
            "hindi_stop",
            "hindi_keywords",
            "hindi_stemmer"
          ]
        }
      }
    }
  }
}

5. indonesian analyzer

The indonesian analyzer could be reimplemented as a custom analyzer as follows:

{
  "settings": {
    "analysis": {
      "filter": {
        "indonesian_stop": {
          "type":       "stop",
          "stopwords":  "_indonesian_" 
        },
        "indonesian_keywords": {
          "type":       "keyword_marker",
          "keywords":   [] 
        },
        "indonesian_stemmer": {
          "type":       "stemmer",
          "language":   "indonesian"
        }
      },
      "analyzer": {
        "indonesian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "indonesian_stop",
            "indonesian_keywords",
            "indonesian_stemmer"
          ]
        }
      }
    }
  }
}

6. portuguese analyzer

The portuguese analyzer could be reimplemented as a custom analyzer as follows:

{
  "settings": {
    "analysis": {
      "filter": {
        "portuguese_stop": {
          "type":       "stop",
          "stopwords":  "_portuguese_" 
        },
        "portuguese_keywords": {
          "type":       "keyword_marker",
          "keywords":   [] 
        },
        "portuguese_stemmer": {
          "type":       "stemmer",
          "language":   "light_portuguese"
        }
      },
      "analyzer": {
        "portuguese": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "portuguese_stop",
            "portuguese_keywords",
            "portuguese_stemmer"
          ]
        }
      }
    }
  }
}

7. thai analyzer

The thai analyzer could be reimplemented as a custom analyzer as follows:

{
  "settings": {
    "analysis": {
      "filter": {
        "thai_stop": {
          "type":       "stop",
          "stopwords":  "_thai_" 
        }
      },
      "analyzer": {
        "thai": {
          "tokenizer":  "thai",
          "filter": [
            "lowercase",
            "thai_stop"
          ]
        }
      }
    }
  }
}