Benchmarking an Elasticsearch cluster is not for the faint of heart. Sure, benchmarking any distributed system is no joke. But when it comes to Elasticsearch, it is so easy to overlook details that could render your experiments uninformative at best and misleading at worst. This is a problem because empirical testing is the best way to validate your cluster architecture and index/query designs, and the only way to do meaningful capacity planning.
Over the years, I have gained some experience, read interesting articles, learned some lessons, and discussed ideas on how to benchmark Elasticsearch. I recently had the perfect opportunity to sort out all these bits and pieces, and write down some best practices for what I consider the building blocks of any Elasticsearch benchmarking experiment. You guessed it, that’s what I’d like to share with you in this post. But before we begin, let’s get one definition out of the way: by Elasticsearch benchmarking, I’m referring to the process of measuring the performance of an Elasticsearch cluster under controlled workloads and conditions.
If you’re anything like me, you don’t benchmark distributed systems for the fun of it. When you benchmark an Elasticsearch cluster, it is critical to define your objectives and acceptance criteria. A typical benchmarking objective is to test alternative settings, such as different cluster architectures, hardware profiles, index designs, or operations. Other common objectives are stress testing and capacity planning to determine the operational limits of your cluster and know when to scale up or out. I recommend pairing each objective with one or more acceptance criteria to quickly rule out any configurations that don’t meet your minimum quality expectations from the cluster. A useful acceptance criterion is that no search or index task is rejected from the Elasticsearch thread pools, or that a query response time must not worsen its average value in the live system.
This is the dedicated Elasticsearch environment for running benchmarking experiments. The environment configuration depends on your objectives. If you aim to test your production cluster’s limits in preparation for the next Black Friday, then your benchmarking environment should be as close as possible to production. But suppose your objective is to explore alternative index designs, data aggregations or ingestion strategies. In that case, a more modest benchmarking environment will give you valuable insights while saving you time and money.
Please, please, automate the provisioning of the benchmarking environment. You will reduce execution time, infrastructure costs, and accidental misconfigurations.
A workload model describes a benchmarking scenario, including the benchmark data to use, the list of Elasticsearch operations to reproduce, the load volume and behaviour. An important workload model is the baseline, usually representing the current Elasticsearch workload, and which you can use as a reference for all your experiments.
A few words about the benchmark data. A good practice is to use real data whenever this is possible (privacy and data protection compliance come first!); for example, by creating ad-hoc snapshots that would be restored in the benchmarking environment. I recommend using the same benchmark data while testing experiments with the same objective because this will make results more comparable. When your benchmark data includes time-series indices (e.g., metrics, logs), make sure to adjust the time range of your operations to the time window of the dataset. Nothing that you can’t easily script.
While most properties of the benchmarking environment and workload model are fixed, a few others, such as the number of data nodes, the heap size, the index mappings, or the number of shards can change across experiments. These variables allow you to test alternative solutions to meet your benchmarking objectives. Empirically.
At this point, I cannot stress enough the importance of changing only one configuration variable at a time between experiments. By doing otherwise, you won’t know what property impacted the cluster performance, and you’ll lose the ability to distinguish causality from correlation from random chance. I get it. Changing one variable at a time makes the benchmarking process slow. Your priority is to make the process accurate, though, and you can always speed it up elsewhere (have I mentioned automation already?).
To launch and control your workload model, you need a load generator. There are plenty out there. You can use HTTP load testing tools such as Locust, Apache JMeter, or Gatling, which can all communicate with the Elasticsearch REST API. Or you can give it a try with the same benchmarking tool developed and used by the Elastic folks: Rally.
Some recommendations on load generators:
- Deploy and launch them on machines with the same specs for all the experiments sharing the same benchmarking objectives.
- Warm up the benchmarking environment before collecting metrics to better reproduce a running cluster’s conditions and reduce cold-start artefacts.
- Keep an eye on the load generator’s performance (CPU, memory, network) and make sure it’s delivering the expected workload without errors.
Monitoring and Reporting
I like to start a benchmarking experiment by creating dedicated dashboards on Kibana, Grafana or any other monitoring tool available to me. The Elasticsearch metrics you should track depend on your objectives, but they typically include search/index throughput, latency, service time, number of rejected operations, CPU/memory/heap usage, and garbage collection stats. Error logs are also critical to understand the benchmarking results. Note that any good load generator can also export metrics and logs as tabular reports and charts.
A blueprint for benchmarking Elasticsearch
Let me summarise my talking points so far in the form of a general blueprint for benchmarking an Elasticsearch cluster.
- Define benchmarking objectives and acceptance criteria.
- Define the benchmarking environment.
- Define the Elasticsearch workload model.
a. If necessary, create the benchmark data.
- Specify the variables of the experiment configuration.
- Set up the load generator.
a. Set up the machine.
b. Specify the workload model in a format that the load generator can execute.
- Define the experiment metrics and set up monitoring and logging.
- Deploy the benchmarking environment with the experiment configuration to test.
a. Check that the cluster is up and running with the expected configuration.
b. Check that the benchmark data is present (unless done by the load generator).
- Start the load generator and run the experiment.
a. When the experiment is over, document all the results, logs and metrics, along with charts’ screenshots or links to dashboard archives.
b. Repeat the same experiment multiple times for statistical significance. You can clear the index cache between runs by using the clear cache API.
- Change one experiment parameter to benchmark a different configuration and go back to step 7 (if you need to redeploy the environment) or 8 (otherwise).
- Analyse the results of the experiments with respect to the benchmarking objectives.
The 10 steps above are painfully high-level. You can use them as a foundation for writing your own benchmarking runbook, specific to your actual environment (cloud? self-hosted?), tooling, and team setup.
Benchmarking Elasticsearch is a fascinating topic to me, and I’m planning to write more on it - I hope that’s good news for you. Next time, I’d like to get more hands-on and benchmark an Amazon Elasticsearch cluster using Rally. Until then, if you have follow-up questions, related problems, or comments, you are welcome to reach out to us at firstname.lastname@example.org.