Elasticsearch is much more than just a search engine; it supports complex aggregations, geo filters, and the list goes on. Best of all, you can run all your queries at a speed you have never seen before. Elasticsearch, like any other open source technology, is very rapidly evolving, but the core fundamentals that power Elasticsearch don’t change.
In this article, we will briefly discuss how Elasticsearch works internally and explain the basic query APIs. All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs.
This Elasticsearch tutorial is an excerpt taken from the book,’Learning Elasticsearch‘ written by Abhishek Andhavarapu.
Inverted index in Elasticsearch
Inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on.
We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear.
Document1: Fear leads to anger
Document2: Anger leads to hate
Document3: Hate leads to suffering
In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it’s the book you need. Computer-based information retrieval systems do the same.
Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value.
Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don’t use the exact words. Now let’s say we encountered a document containing the following:
Yosemite national park may be closed for the weekend due to forecast of substantial rainfall
We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown:
To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections.
Stemming is the process of reducing a derived word into its root word. For example, rain, raining, rained, rainfall has the common root word “rain”. When a document is indexed, the root word is stored in the index instead of the actual word. Without stemming, we end up storing rain, raining, rained in the index, and search relevance would be very low. The query terms also go through the stemming process, and the root words are looked up in the index. Stemming increases the likelihood of the user finding what he is looking for. When we query for rain in yosemite, even though the document originally had rainfall, the inverted index will contain term rain. We can configure stemming in Elasticsearch using Analyzers.
Similar to rain and raining, weekend and sunday mean the same thing. The document might not contain Sunday, but if the information retrieval system can also search for synonyms, it will significantly improve the search quality. Human language deals with a lot of things, such as tense, gender, numbers. Stemming and synonyms will not only improve the search quality but also reduce the index size by removing the differences between similar words.
Pen, Pen[s] -> Pen
Eat, Eating -> Eat
As a user, we almost always search for phrases rather than single words. The inverted index in the previous section would work great for individual terms but not for phrases. Continuing the previous example, if we want to query all the documents with a phrase anger leads to in the inverted index, the previous index would not be sufficient. The inverted index for terms anger and leads is shown below:
From the preceding table, the words anger and leads exist both in document1 and document2. To support phrase search along with the document, we also need to record the position of the word in the document. The inverted index with word position is shown here:
1:2, 2:2, 3:2
Now, since we have the information regarding the position of the word, we can search if a document has the terms in the same order as the query.
Since document2 has anger as the first word and leads as the second word, the same order as the query, document2 would be a better match than document1. With the inverted index, any query on the documents is just a simple lookup. This is just an introduction to inverted index; in real life, it’s much more complicated, but the fundamentals remain the same. When the documents are indexed into Elasticsearch, documents are processed into the inverted index.
Scalability and availability in Elasticsearch
Let’s say you want to index a billion documents; having just a single machine might be very challenging. Partitioning data across multiple machines allows Elasticsearch to scale beyond what a single machine do and support high throughput operations. Your data is split into small parts called shards. When you create an index, you need to tell Elasticsearch the number of shards you want for the index and Elasticsearch handles the rest for you. As you have more data, you can scale horizontally by adding more machines. We will go in to more details in the sections below.
There are type of shards in Elasticsearch – primary and replica. The data you index is written to both primary and replica shards. Replica is the exact copy of the primary. In case of the node containing the primary shard goes down, the replica takes over. This process is completely transparent and managed by Elasticsearch. We will discuss this in detail in the Failure Handling section below. Since primary and replicas are the exact copies, a search query can be answered by either the primary or the replica shard. This significantly increases the number of simultaneous requests Elasticsearch can handle at any point in time.
As the index is distributed across multiple shards, a query against an index is executed in parallel across all the shards. The results from each shard are then gathered and sent back to the client. Executing the query in parallel greatly improves the search performance.
Now, we will discuss the relation between node, index and shard.
Relation between node, index, and shard
Shard is often the most confusing topic when I talk about Elasticsearch at conferences or to someone who has never worked on Elasticsearch. In this section, I want to focus on the relation between node, index, and shard. We will use a cluster with three nodes and create the same index with multiple shard configuration, and we will talk through the differences.
Three shards with zero replicas
We will start with an index called esintroduction with three shards and zero replicas. The distribution of the shards in a three node cluster is as follows:
In the above screenshot, shards are represented by the green squares. We will talk about replicas towards the end of this discussion. Since we have three nodes(servers) and three shards, the shards are evenly distributed across all three nodes. Each node will contain one shard. As you index your documents into the esintroduction index, data is spread across the three shards.
Six shards with zero replicas
Now, let’s recreate the same esintroduction index with six shards and zero replicas. Since we have three nodes (servers) and six shards, each node will now contain two shards. The esintroduction index is split between six shards across three nodes.
The distribution of shards for an index with six shards is as follows:
The esintroduction index is spread across three nodes, meaning these three nodes will handle the index/query requests for the index. If these three nodes are not able to keep up with the indexing/search load, we can scale the esintroduction index by adding more nodes. Since the index has six shards, you could add three more nodes, and Elasticsearch automatically rearranges the shards across all six nodes. Now, index/query requests for the esintroduction index will be handled by six nodes instead of three nodes. If this is not clear, do not worry, we will discuss more about this as we progress in the book.
Six shards with one replica
Let’s now recreate the same esintroduction index with six shards and one replica, meaning the index will have 6 primary shards and 6 replica shards, a total of 12 shards. Since we have three nodes (servers) and twelve shards, each node will now contain four shards. The esintroduction index is split between six shards across three nodes. The green squares represent shards in the following figure.
The solid border represents primary shards, and replicas are the dotted squares:
As we discussed before, the index is distributed into multiple shards across multiple nodes. In a distributed environment, a node/server can go down due to various reasons, such as disk failure, network issue, and so on. To ensure availability, each shard, by default, is replicated to a node other than where the primary shard exists. If the node containing the primary shard goes down, the shard replica is promoted to primary, and the data is not lost, and you can continue to operate on the index.
In the preceding figure, the esintroduction index has six shards split across the three nodes. The primary of shard 2 belongs to node elasticsearch 1, and the replica of the shard 2 belongs to node elasticsearch 3. In the case of the elasticsearch 1 node going down, the replica in elasticsearch 3 is promoted to primary. This switch is completely transparent and handled by Elasticsearch.
One of the reasons queries executed on Elasticsearch are so fast is because they are distributed. Multiple shards act as one index. A search query on an index is executed in parallel across all the shards.
Let’s take an example: in the following figure, we have a cluster with two nodes: Node1, Node2 and an index named chapter1 with two shards: S0, S1 with one replica:
Assuming the chapter1 index has 100 documents, S1 would have 50 documents, and S0 would have 50 documents. And you want to query for all the documents that contain the word Elasticsearch. The query is executed on S0 and S1 in parallel. The results are gathered back from both the shards and sent back to the client. Imagine, you have to query across million of documents, using Elasticsearch the search can be distributed. For the application I’m currently working on, a query on more than 100 million documents comes back within 50 milliseconds; which is simply not possible if the search is not distributed.
Failure handling in Elasticsearch
Elasticsearch handles failures automatically. This section describes how the failures are handled internally. Let’s say we have an index with two shards and one replica. In the following diagram, the shards represented in solid line are primary shards, and the shards in the dotted line are replicas:
As shown in preceding diagram, we initially have a cluster with two nodes. Since the index has two shards and one replica, shards are distributed across the two nodes. To ensure availability, primary and replica shards never exist in the same node. If the node containing both primary and replica shards goes down, the data cannot be recovered. In the preceding diagram, you can see that the primary shard S0 belongs to Node 1 and the replica shard S0 to the Node 2.
Next, just like we discussed in the Relation between Node, Index and Shard section, we will add two new nodes to the existing cluster, as shown here:
The cluster now contains four nodes, and the shards are automatically allocated to the new nodes. Each node in the cluster will now contain either a primary or replica shard. Now, let’s say Node2, which contains the primary shard S1, goes down as shown here:
Since the node that holds the primary shard went down, the replica of S1, which lives in Node3, is promoted to primary. To ensure the replication factor of 1, a copy of the shard S1 is made on Node1. This process is known as rebalancing of the cluster.
Depending on the application, the number of shards can be configured while creating the index. The process of rebalancing the shards to other nodes is entirely transparent to the user and handled automatically by Elasticsearch.
We discussed inverted indexes, relation between nodes, index and shard, distributed search and how failures are handled automatically in Elasticsearch.
Check out this book, ‘Learning Elasticsearch‘ to know about handling document relationships, working with geospatial data, and much more.