@xtccc 2015-12-08T09:52:16.000000Z 字数 4088 阅读 2232

Mapping, Index and Analyzers


+ Mapping
+ Data In, Data Out
+ Analysis and Analyzers



Types are also called mapping types because they’re typically used as containers for different types of documents—documents with different structures. The definition of fields in each type is called a mapping.

The mapping is automatically created with your new document, and it automatically detects your fields' types. If you add a new document with yet another new field, Elasticsearch guesses its type, too and appends the new field to the mapping.


Explicit mapping 是定义在index/type这个层次上的。默认情况下我们不需要定义explicit mapping,因为ES会帮我们在暗地里创建,但是我们也可以改写ES创建的默认mapping。

通过 Put Mapping API,我们可以创建mapping,此外也可以在创建Index时添加多种mapping。


  1. $ curl 'ecs1:9200/get-together/group/_mapping?pretty'
  2. {
  3. "get-together" : {
  4. "mappings" : {
  5. "group" : {
  6. "properties" : {
  7. "name" : {
  8. "type" : "string"
  9. },
  10. "organizer" : {
  11. "type" : "string"
  12. }
  13. }
  14. }
  15. }
  16. }
  17. }


在ElasticSearch中,all data in every field is indexed by default。也就是说,每一个field都有自己专门的倒排索引(inverted index)。

一个document可以用一个JSON object来表达。

_index: Where the doc lies; just like database in traditional db;

_type: The class of the object that the doc represents; just like table in traditional db; each type has it own mapping or schema definition, which defines the data structure for docs of that type;

_id: The unique identifier for the doc


Create an Index


关于Index的其他设置,可以参考Index Modules


创建 Mapping


  1. curl -XPOST 'localhost:9200/twitter?pretty' -d ' {
  2. "settings" : {
  3. "number_of_shards" : 1
  4. },
  5. "mappings": {
  6. "type_1" : {
  7. "_source" : { "enabled" : false},
  8. "properties" : {
  9. "field_1" : { "type" : "string", "index" : "not_analyzed" }
  10. }
  11. }
  12. }
  13. } '


  1. curl -XPUT 'localhost:9200/twitter?pretty'
  2. curl -XPUT 'localhost:9200/twitter/_mapping/tweet' -d '
  3. {
  4. "tweet" : {
  5. "properties" : {
  6. "message" : { "type" : "string", "store" : true}
  7. }
  8. }
  9. } '

上例在twitter index中创建了一个名为tweet的mapping。这个mapping指明了: message field应该被存储(By default fields are not stored, just indexed)。

查询 Mapping

  1. curl -XGET 'localhost:9200/twitter/_mapping/tweet?pretty'

Analysis and Analyzers


  • 对一段文本进行分词
  • 对分词后的terms进行normalize


Character filter: 对string进行过滤,例如将&转换为and

Tokenizer: 对string进行分词

Token filter: 对分词后的terms进行过滤,例如转换成小写字母,删除stop words

· Standard analyzer
· Simple analyzer
· Whitespace analyzer
· Language analyzer

其中,Language analyzer 可以针对很多语言进行分词,包括中文。


When we index a document, its full-text fields are analyzed into terms that are used to create the inverted index. However, when we search on a full-text field, we need to pass the query string through the same analysis process, to ensure that we are searching for terms in the same form as those that exist in the index.

Full-text queries, understand how each field is defined, and so they can do the right thing:

When you query a full-text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for.

When you query an exact-value field, the query will not analyze the query string, but instead search for the exact value that you have specified.

什么是 full-text field 与 exact-value field ?


下面,我们将使用Standard Analyzer来对一句话进行分词:

  1. 不指定任何index
  1. curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'hello 100, this is me!'
  2. --> 返回如下结果
  3. {
  4. "tokens" : [ {
  5. "token" : "hello",
  6. "start_offset" : 0,
  7. "end_offset" : 5,
  8. "type" : "<ALPHANUM>",
  9. "position" : 1
  10. }, {
  11. "token" : "100",
  12. "start_offset" : 6,
  13. "end_offset" : 9,
  14. "type" : "<NUM>",
  15. "position" : 2
  16. }, {
  17. "token" : "this",
  18. "start_offset" : 11,
  19. "end_offset" : 15,
  20. "type" : "<ALPHANUM>",
  21. "position" : 3
  22. }, {
  23. "token" : "is",
  24. "start_offset" : 16,
  25. "end_offset" : 18,
  26. "type" : "<ALPHANUM>",
  27. "position" : 4
  28. }, {
  29. "token" : "me",
  30. "start_offset" : 19,
  31. "end_offset" : 21,
  32. "type" : "<ALPHANUM>",
  33. "position" : 5
  34. } ]
  35. }

2. 对HTML进行分词

  1. curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip&pretty' -d 'this is a <b>test</b>'
  2. --> 返回如下结果
  3. {
  4. "tokens" : [ {
  5. "token" : "this is a test",
  6. "start_offset" : 0,
  7. "end_offset" : 21,
  8. "type" : "word",
  9. "position" : 1
  10. } ]
  11. }

3. 指定一个特定的index进行分词

  1. curl -XGET 'localhost:9200/test/_analyze?text=100+this+is+a+test+!'

当然,这要求test index存在,并且它含有analyzer(每个index都有一个默认的analyzer)。
