@suyuening 2017-02-15T01:27:48.000000Z 字数 6891 阅读 1598

Elasticsearch_Reference_中文版(v5.2)

入门

入门
- Basic Concepts
- 基本概念
  - Near Realtime (NRT)
  - 近实时(NRT)
  - Cluster
  - 集群
  - Node
  - Index
  - Type
  - Document
  - Shards & Replicas

Elasticsearch是一个高度可扩展的开源全文搜索和分析引擎。它允许你存储，搜索，近实时的快速分析大量数据。它通常是作为具有复杂搜索功能和要求应用的底层引擎技术。

这里有几个使用案例，Elasticsearch可用于：

你运营一个在线网络商店，你让你的客户搜索你卖的产品。在这种情况下，你可以使用Elasticsearch来存储你的整个产品目录和库存并为他们提供搜索和联想输入补全功能。
你想收集日志或交易数据，并且想分析和挖掘这些数据，寻找趋势，统计，总结，或异常。在这种情况下，你可以使用LogStash（Elasticsearch/LogStash/Kibana技术栈的一部分）来收集、汇总，并分析你的数据，然后Logstash注入这些数据到Elasticsearch。一旦数据在Elasticsearch中，你可以运行搜索和聚合，来挖掘任何你感兴趣的信息。
你运营一个价格预警平台，它允许价格敏感的客户指定一种规则，像“我对购买一种特定的电子产品感兴趣，如果在未来一个月内，任何供应商的这种产品价格低于X美元，则通知我”。在这种情况下，你可以抓取供应商的价格，把它们插入到Elasticsearch中，并使用Elasticsearch的反向搜索（过滤器）功能来匹配客户查询的价格走势，最终一旦区配则将警报发送给客户。
你有分析/商业智能需求，要迅速调查、分析、可视化和提出大量数据要求的特定问题(ask ad-hoc questions on a lot of data)（认为数百万或数十亿条记录）。在这种情况下，你可以使用Elasticsearch来存储你的数据，然后用Kibana（Elasticsearch/LogStash/Kibana技术栈的一部分）建立自定义的仪表板，可视化你的数据，这对你来说是很重要的。此外，您可以使用Elasticsearch的聚合功能来执行复杂的商业智能查询。

本教程的其余部分，我将引导你通过Elasticsearch的启动和运行过程中，一窥其内部，并进行基本操作如索引、搜索和修改数据。在本教程的最后，你应该熟知Elasticsearch是什么，它是如何工作的，并希望得到启发，看如何使用它来构建复杂的搜索应用程序，或者从你的数据中掘金。

Basic Concepts

基本概念

There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
Elasticsearch有几个核心概念。从一开始，理解这些概念将大大有助于Elasticsearch的学习进程。

Near Realtime (NRT)

近实时(NRT)

Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
Elasticsearch是近实时搜索平台。这意味着从你索引文件一直到索引变得可搜索有轻微的延迟（通常为1秒）。

Cluster

集群

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
一个集群是集合的一个或多个节点（服务器）共同持有你的整个数据和提供联合索引和搜索功能在所有节点。一个集群由一个唯一的名称默认为“Elasticsearch”。因为一个节点只能如果节点设置的名称加入集群是群集的一部分，这个名字是很重要的。

Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters.

Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.

Node

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.

A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.

In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.

Index

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.

In a single cluster, you can define as many indexes as you want.

Type

Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.

Document

A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.

Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.

Shards & Replicas

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

Sharding is important for two primary reasons:

It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.

Replication is important for two primary reasons:

It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact.

By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.

Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the _cat/shards api.

With that out of the way, let’s get started with the fun part…