Elasticsearch is a RESTful search engine. It’s built to be flexible in terms of how it searches data inside the Elastic Stack. Within the stack, data is organized into indices and further broken down into shards. The sizes of shards and the organization of indices are crucial to search speed. However, it isn’t always easy to determine what the optimal size of shards should be. For the same reason, organizing data into indices can also be challenging. Both of these tasks, although there are general guidelines, are often application dependent.
Shards are the basic units that Elasticsearch uses to distribute data around clusters. Both the number of shards and the indices used to group them can vary within a cluster. In general, most applications seek to create a uniform shard size. This makes organization easier, but it’s not always practical. There are performance concerns directly related to the overhead involved in processing shards. Up to a certain point, querying many small shards can make processing speed faster. When dealing with large data sets, overhead from processing many small shards can overcome any performance gains afforded by their small size. The point at which this happens varies between applications. Identifying this point is crucial to avoid a sudden drop in search performance. Users expect searches to be fast, regardless of the amount of data they query.
Elasticsearch contains two APIs which can be used to manage shard size. The rollover index API specifies the number of documents an index should contain. It also identifies the maximum number of documents that should be written for an index. When documents are grouped into an index, they are written onto an immutable section on the disk. They are not available for searching until they have been written onto the disk. The shrink index API reorganizes an index into fewer shards. No shard data is lost when this reorganization happens; deletion of shards happens by a different mechanism. Both APIs allow data storage to be dynamic so that the amount of overhead when searching is minimized.
Every shard, regardless of size, must be stored in memory, and every shard uses heap space. Resource handling is essential when managing shards. Elasticsearch can perform many search types, so shards must be organized in a way that allows any search to succeed without much overhead. A certain amount of overhead is unavoidable because the system must access shards and indices when available, but a proper organization can mitigate most problems.
While it is not possible to plan ahead for a perfect shard size or a perfect organizational scheme, Elasticsearch does have mechanisms available to help maintain efficiency. Complete indices can be deleted directly from the file system, which removes the need to delete all shards and their component records individually. Elasticsearch is also horizontally scalable and can run the same way on a single node as in a multinode cluster. It automatically manages the way that indices and queries are distributed, which can significantly reduce slowdowns.
Since shards are generic data structures, Elasticsearch can be integrated into a variety of other data management programs. Elasticsearch clients exist for many programming languages as well. This ability for integration allows deeper analysis of data, including aggregations that can explore trends and patterns. With properly defined shards, it’s possible to see not only a granular view of data but a larger picture as well.
There is no right or wrong way to define shard size, and the optimal shard size required will depend on the nature of an application. Some applications will require larger shards, while others will require small shards. When determining shard size, it’s best to test the system as much as possible. Running test searches and recording benchmark times in which those searches complete is an effective way to monitor system performance. If the system slows down, a reevaluation of shard size may be necessary.
Shard size data management is not intended to ensure perfect organization but to implement the most efficient data management and search speed possible. Users will often abandon a search that is going slowly, and slow searches could be a significant problem when dealing with time-sensitive information. Slow searches and mismanagement of data could mean the difference between an application functioning well, for a long period of time, or being scrapped in favor of something better.
These Elasticsearch data management guidelines are a good starting point for data analysts and administrators. However, there may be times when the flexibility provided by Elasticsearch to specifically address an application’s needs could outweigh potential speed optimization. Since Elasticsearch is highly customizable, users should be able to configure it in order to meet their specific needs while still keeping these basic concepts in mind. Optimization inside Elasticsearch is a matter of balancing search speed with data organization, an undertaking that must be as flexible as the searches it handles.
Weblink Technologies, a leader in Elasticsearch products, provides a solution based solely on Elastic-search. As an Elastic partner and reseller, we have worked with many of customers across the globe to provide expert consulting and implementation for Elasticsearch, Logstash, Kibana (ELK), and Beats. Whether you are using Elasticsearch for a web-facing application, your corporate intranet, or a search-powered big data analytics platform, our Elasticsearch experts bring end-to-end services that support your search and analytics infrastructure, enabling you to maximize ROI.
Contact us at [email protected] to learn more about how we can help you leverage Elastic products for high-performing, easy-to-maintain, and scalable search and analytics solutions.