@xtccc
2015-12-09T10:37:42.000000Z
字数 12807
阅读 2248
ElasticSearch
存储数据的过程,称为indexing。To index a document, we must tell ElasticSearch which type in the index it should go to.
WHAT HAPPENS WHEN YOU INDEX A DOCUMENT?
By default, when you index a document, it’s first sent to one of the primary shards, which is chosen based on a hash of the document’s ID. Then, the document is sent to be indexed in all of that primary shard’s replicas.
The Elasticsearch node that receives your indexing request first selects the shard to index the document to. By default, documents are distributed evenly between shards.
下面我们在megacorp
这个index中创建一个文档,令其type为employee
,ID为1
:
[root@ecs1 elasticsearch-1.7.2]# curl -XPUT localhost:9200/megacorp/employee/1?pretty -d '
> {
> "firstname" : "Tao",
> "lastname" : "Xiao",
> "age" : 30,
> "about" : "Hello, my wife is CCC",
> "interests" : ["coding", "jogging"]
> }'
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_version" : 1,
"created" : true
}
在indexing document的过程中,尚不存在的index、type都会被自动创建。
从上面的PUT Request的返回响应来看,它的返回包含了被indexed的文档的index、type、id、version以及created,其中,version
指的是该文档的版本号。每当一个document发生变化时(包括DELETE),它的metadata中的version
就会自增1;而created
为false,则表明这个文档是第一次被创建。
WHAT HAPPENS WHEN YOU SEARCH AN INDEX?
When you search an index, Elasticsearch has to look in a complete set of shards for that index . Those shards can be either primary or replicas because primary and replica shards typically contain the same documents. Elasticsearch distributes the search load between the primary and replica shards of the index you’re searching, making replicas useful for both search performance and fault tolerance.
Elasticsearch uses a round-robin format to forward the request to the cluster’s nodes and shards. As shown in figure 2.9, Elasticsearch then gathers results from those shards, aggregates them into a single reply, and forwards the reply back to the client application.
在全部的文档中查询
curl localhost:9200/_search
在gb这个index中查询
curl localhost:9200/gb/_search
在以g开头和以u开头这两个index中查询
curl localhost:9200/g*,u*/_search
在gb和us这两个index中查询
curl localhost:9200/gb,us/_search
在全局范围的user和twitter这两个index中查询
curl localhost:9200/_all/user,twitter/_search
[root@ecs1 elasticsearch-1.7.2]# curl -i localhost:9200/megacorp/employee/1?pretty
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 283
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source":
{
"firstname" : "Tao",
"lastname" : "Xiao",
"age" : 30,
"about" : "Hello, my wife is CCC",
"interests" : ["coding", "jogging"]
}
}
可以看到,response返回了该文档的metadata,并通过_source
域返回了 full JSON document。
如果希望只返回文档,而不返回关于该文档的metadata,则可以如下查询:
curl localhost:9200/megacorp/employee/1/_source?pretty
curl localhost:9200/megacorp/employee/1?_source=firstname,interests
[root@ecs1 tmp]# curl localhost:9200/_mget?pretty -d '
> {
> "docs" : [
> {
> "_index" : "megacorp",
> "_type" : "employee",
> "_id" : 2
> },
> {
> "_index" : "megacorp",
> "_type" : "employee",
> "_id" : 3
> }
> ]
> } '
{
"docs" : [ {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_version" : 1,
"found" : true,
"_source": {
"firstname" : "Jack",
"lastname" : "Chen",
"age" : 40,
"about" : "Hello, I am Jack",
"interests" : ["sports", "music"]
}
}, {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "3",
"_version" : 1,
"found" : true,
"_source": {
"firstname" : "Lucy",
"lastname" : "Liu",
"age" : 50,
"about" : "Hello, I am Lucy",
"interests" : ["tv", "talking"]
}
} ]
}
如果要获取的多个文档在同一个index和type中,则可以如下构造请求:
[root@ecs1 tmp]# curl localhost:9200/megacorp/employee/_mget -d '
> {
> "docs" : [
> { "_id" : 2 },
> { "_id" : 3 }
> ]
> }'
在默认情况下,一次查询会返回符合条件的前10个文档,并且这10个文档是按照_score
排序的。我们可以通过参数size
来控制一次返回多少个文档,同时通过参数from
控制返回文档时跳过前面多少个文档。
ES中的文档是不可变的(immutable),因此如果要更新一个文档,只能replace it。例如,这里我们更新坐标为megacorp/employee/1的文档:
[root@ecs1 ~]# curl -XPUT localhost:9200/megacorp/employee/1?pretty -d '
> {
> "age" : 31,
> "about" : "I amm updated"
> }'
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_version" : 2,
"created" : false
}
可见:
_version
增1,且created
由 true变为了false通过_update
可以实现,但是本质上依然遵循了retrieve -> change -> reindex
的路线。
通过 partial update,可以增加新的fields,也可以修改已有的fields,如下:
[root@ecs1 ~]# curl localhost:9200/megacorp/employee/8?pretty
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "8",
"_version" : 20,
"found" : true,
"_source": {
"name" : "Tao",
"city" : "NJ"
}
}
[root@ecs1 ~]# curl localhost:9200/megacorp/employee/8/_update?pretty -d '
> {
> "doc" : {
> "tags" : ["A", "B"],
> "gender" : "male",
> "city" : "Taixing"
> }
> }'
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "8",
"_version" : 21
}
[root@ecs1 ~]# curl localhost:9200/megacorp/employee/8?pretty
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "8",
"_version" : 21,
"found" : true,
"_source":{
"name":"Tao",
"city":"Taixing",
"tags":["A","B"],
"gender":"male"
}
}
如果在index document时,希望做到:只有当该文档尚未存在(即同样的 index/type/id)时,才能index新的文档;如果已存在,就不能将旧的文档覆盖。
方法1:增加_create
[root@ecs1 ~]# curl -i -XPUT localhost:9200/megacorp/employee/1/_create -d '
> {
> "age" : 31,
> "about" : "I amm updated"
> }'
HTTP/1.1 409 Conflict
Content-Type: application/json; charset=UTF-8
Content-Length: 109
{"error":"DocumentAlreadyExistsException[[megacorp][4] [employee][5]: document already exists]","status":409}
方法2:增加op_type=create
[root@ecs1 ~]# curl -i -XPUT localhost:9200/megacorp/employee/1?op_type=create -d '
> {
> "age" : 31,
> "about" : "I amm updated"
> }'
HTTP/1.1 409 Conflict
Content-Type: application/json; charset=UTF-8
Content-Length: 109
{"error":"DocumentAlreadyExistsException[[megacorp][6] [employee][7]: document already exists]","status":409}
用动词DELETE
来删除一个指定的文档
[root@ecs1 elasticsearch-1.7.2]# curl -i -XDELETE localhost:9200/megacorp/employee/1?pretty
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 103
{
"found" : true,
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_version" : 2
}
[root@ecs1 elasticsearch-1.7.2]# curl -i -XHEAD localhost:9200/megacorp/employee/1?pretty
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=UTF-8
Content-Length: 0
注意这里:删除后该文档的_version
加1了,即使要删除的文档不存在,_version
还是会加1。
用动词HEAD
来检测指定的文档是否存在
[root@ecs1 elasticsearch-1.7.2]# curl -i -XHEAD localhost:9200/megacorp/employee/1?pretty
HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 0
HEAD Request不返回body,只返回HTTP header
上面,我们指定了ID,也可以不指定ID,而是让ES随机生成一个ID,如下:
[root@ecs1 elasticsearch-1.7.2]# curl -XPOST localhost:9200/megacorp/employee?pretty -d '
> {
> "firstname" : "Tao",
> "lastname" : "Xiao",
> "age" : 30,
> "about" : "Hello, my wife is CCC",
> "interests" : ["coding", "jogging"]
> }'
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "AVCE4xivMv8zm4P4wh-e",
"_version" : 1,
"created" : true
}
注意,此时的 -XPUT
要换成 -XPOST
。ES生成的随机ID是一个: 22-character long, URL-safe, Base64-encoded string universally unique identifier, or UUID。
In the database world, two approaches are commonly used to ensure that changes are not lost when making concurrent updates:
Pessimistic concurrency control
Widely used by relational databases, this approach assumes that conflicting changes are likely to happen and so blocks access to a resource in order to prevent conflicts. A typical example is locking a row before reading its data, ensuring that only the thread that placed the lock is able to make changes to the data in that row.
Optimistic concurrency control
Used by Elasticsearch, this approach assumes that conflicts are unlikely to happen and doesn’t block operations from being attempted. However, if the underlying data has been modified between reading and writing, the update will fail. It is then up to the application to decide how it should resolve the conflict. For instance, it could reattempt the update, using the fresh data, or it could report the situation to the user.
参考 Optimistic Concurrency Control
UPDATE
和DELETE
类的API都可以通过version
参数来实现乐观锁。假设我们新增了一个文档:
[root@ecs1 ~]# curl -XPUT localhost:9200/megacorp/employee/7?pretty -d '
> {
> "name" : "Tao",
> "city" : "NJ"
> }'
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "7",
"_version" : 1,
"created" : true
}
可见这个文档的version
为1。
下面我们去更新这个文档,为了防止在更新该文档时,它已经被另外一个用户更新了,我们在更新时指定文档的version
为1,如下:
[root@ecs1 ~]# curl -XPUT localhost:9200/megacorp/employee/7?version=1 -d '
> {
> "name" : "Tao",
> "city" : "SH"
> }'
{"_index":"megacorp","_type":"employee","_id":"7","_version":2,"created":false}
更新成功!且文档的version现在变为了2。
如果现在我们仍然指定version=1去更新该文档,则会失败:
[root@ecs1 ~]# curl -XPUT localhost:9200/megacorp/employee/7?version=1 -d '
> {
> "name" : "Tao",
> "city" : "BJ"
> }'
{"error":"VersionConflictEngineException[[megacorp][3] [employee][7]: version conflict, current [2], provided [1]]","status":409}
有时候,我们会将外部存储系统(例如MySQL)的数据导入到ES中,那我们就可以将这些数据的一些属性(例如时间戳)作为ES中文档的Version Number。此时,我们可以通过参数version_type=external
来声明,如下:
[root@ecs1 ~]# curl -XPUT "localhost:9200/megacorp/employee/8?version=10&version_type=external" -d '
> {
> "name" : "Tao",
> "city" : "BJ"
> }'
{"_index":"megacorp","_type":"employee","_id":"8","_version":10,"created":true}
对于external version number,在更新同一个文档时,指定的version number必须大于该文档原来的version number(不必严格等于),且在范围(0, 9.2e+18)内。下面验证:
[root@ecs1 ~]# curl -XPUT "localhost:9200/megacorp/employee/8?version=20&version_type=external" -d '
> {
> "name" : "Tao",
> "city" : "NJ"
> }'
{"_index":"megacorp","_type":"employee","_id":"8","_version":20,"created":false}
如果通过在API中指定version的方式,在更新文档时发现读写不一致,则更新操作会失败。我们可以指定其失败后重试的次数,来实现自动地重试。
这个参数是retry_on_conflict
,其默认值为0。
[root@ecs1 ~]# curl -XPOST localhost:9200/megacorp/employee/7/_update?retry_on_conflict=6 -d '
{
"script" : "ctx._source.city=NY"
}'
{
"_index":"megacorp",
"_type":"employee",
"_id":"7",
"_version":3
}
首先,新增一个文档:
[root@ecs1 elasticsearch-1.7.2]# curl -XPUT localhost:9200/megacorp/employee/8?pretty -d '
> {
> "tags" : ["A", "B"],
> "gender" : "male",
> "city" : "Taixing",
> "age" : 30
> }'
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "8",
"_version" : 32,
"created" : true
}
现在,更新一个整型field(age
)的值,将其加1:
curl -XPOST localhost:9200/megacorp/employee/8/_update -d '
> {
> "script" : "ctx._source.age+=1"
> }'
看看这个值变了没有:
[root@ecs1 elasticsearch-1.7.2]# curl localhost:9200/megacorp/employee/8?pretty
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "8",
"_version" : 33,
"found" : true,
"_source":{
"tags":["A","B"],
"gender":"male",
"city":"Taixing",
"age":31}
}
再对一个数组field(tags
)附加一个元素
[root@ecs1 elasticsearch-1.7.2]# curl -XPOST localhost:9200/megacorp/employee/8/_update -d '
{
"script" : "ctx._source.tags+=new_tag",
> "params" : {
> "new_tag" : "C"
> }
> }'
[root@ecs1 elasticsearch-1.7.2]# curl localhost:9200/megacorp/employee/8?pretty
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "8",
"_version" : 34,
"found" : true,
"_source":{
"tags":["A","B","C"],
"gender":"male",
"city":"Taixing",
"age":31}
}
我们希望删除age
为31的文档:
[root@ecs1 elasticsearch-1.7.2]# curl -XPOST localhost:9200/megacorp/employee/8/_update -d '
> {
> "script" : "ctx.op = ctx._source.age == target_age ? 'delete' : 'none'",
> "params" : {
> "target_age" : 31
> }
> }'
但是这样会报错:
{
"error":"ElasticsearchIllegalArgumentException[failed to execute script];
nested: GroovyScriptExecutionException[MissingPropertyException[No such property: delete for class: da51a7f9a0e50de83432eb2d5f50321c8bce1178]]; ",
"status":400
}
Why?
考虑这样的一个场景:将一个文档的views
域加1,若这个文档还不存在,则创建该文档,并令其views
域的值为1。
[root@ecs1 ~]# curl -XPOST localhost:9200/megacorp/employee/9/_update -d '
> {
> "script" : "ctx._source.views+=1",
> "upsert" : {
> "views" : 1
> }
> }'
{
"_index":"megacorp",
"_type":"employee",
"_id":"9",
"_version":1
}
[root@ecs1 ~]# curl localhost:9200/megacorp/employee/9?pretty
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "9",
"_version" : 1,
"found" : true,
"_source":{"views":1}
}
[root@ecs1 ~]# curl -XPOST localhost:9200/megacorp/employee/9/_update -d '
> {
> "script" : "ctx._source.views+=1",
> "upsert" : {
> "views" : 1
> }
> }'
{
"_index":"megacorp",
"_type":"employee",
"_id":"9",
"_version":2
}
[root@ecs1 ~]# curl localhost:9200/megacorp/employee/9?pretty
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "9",
"_version" : 2,
"found" : true,
"_source":{"views":2}
}
下面,我们要在一个type的范围内对文档进行Query,首先我们在上述的type内加入3个document:
[root@ecs1 elasticsearch-1.7.2]# curl localhost:9200/megacorp/employee/1?pretty -d '
> {
> "firstname" : "Tao",
> "lastname" : "Xiao",
> "age" : 30,
> "about" : "Hello, my wife is CCC",
> "interests" : ["coding", "jogging"]
> }'
[root@ecs1 elasticsearch-1.7.2]# curl localhost:9200/megacorp/employee/2?pretty -d '
> {
> "firstname" : "Jack",
> "lastname" : "Chen",
> "age" : 40,
> "about" : "Hello, I am Jack",
> "interests" : ["sports", "music"]
> }'
[root@ecs1 elasticsearch-1.7.2]# curl localhost:9200/megacorp/employee/3?pretty -d '
> {
> "firstname" : "Lucy",
> "lastname" : "Liu",
> "age" : 50,
> "about" : "Hello, I am Lucy",
> "interests" : ["tv", "talking"]
> }'
在该type范围内search全部的文档:
[root@ecs1 elasticsearch-1.7.2]# curl localhost:9200/megacorp/employee/_search?pretty
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 1.0,
"_source":
{
"firstname" : "Tao",
"lastname" : "Xiao",
"age" : 30,
"about" : "Hello, my wife is CCC",
"interests" : ["coding", "jogging"]
}
}, {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "AVCE4xivMv8zm4P4wh-e",
"_score" : 1.0,
"_source":
{
"firstname" : "Tao",
"lastname" : "Xiao",
"age" : 30,
"about" : "Hello, my wife is CCC",
"interests" : ["coding", "jogging"]
}
}, {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_score" : 1.0,
"_source":
{
"firstname" : "Jack",
"lastname" : "Chen",
"age" : 40,
"about" : "Hello, I am Jack",
"interests" : ["sports", "music"]
}
}, {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "3",
"_score" : 1.0,
"_source":
{
"firstname" : "Lucy",
"lastname" : "Liu",
"age" : 50,
"about" : "Hello, I am Lucy",
"interests" : ["tv", "talking"]
}
} ]
}
}
可见,response通过hits
域返回了满足search条件的全部3个文档,以及这些文档自身。
默认情况下,一次search会返回前10个结果。