Level Up your Amazon OpenSearch Service and Keep Costs Down

How we helped a MarTech company to save costs and improve operational efficiency with Amazon OpenSearch Service
08.12.2021

The Project

The success of a marketing program depends on continuous analysis of performance metrics to make informed, data-driven decisions about partnerships and campaigns. When you are one of the largest global players in Digital Marketing, you require performant and scalable systems for analyzing the vast amount of data generated every day. kreuzwerker was mandated to assess one of the analytics systems based on Amazon OpenSearch Service (successor of Amazon Elasticsearch), identify areas of cost optimization, and advise on best practices for operation and maintenance.

The Problem

Our client offers a broad range of tools to monitor performance indicators of marketing campaigns, such as clicks, impressions and conversion rates. The project targeted the system responsible for tracking, analyzing, and building customer-facing reports of click and impression data. The data is streamed into the system through Apache Kafka, enriched by Spring Boot microservices, indexed in an Elasticsearch cluster running on Amazon OpenSearch Service, and regularly aggregated by Spark jobs. Using Elasticsearch as the analytics engine led to good performance and stability, especially during times with high load such as Black Friday. Furthermore, choosing the managed Amazon OpenSearch Service greatly simplified cluster maintenance in the early stages, despite the team having limited experience with it.

The focus of this project was to build on top of this foundation and take the system to the next level, addressing the following topics:

  • Review the current cluster configuration and identify possible inefficiencies and opportunities to save on costs.
  • Improve maintenance automation, especially for creating staging environments for testing and QA, instead of using the AWS console for all the operations.
  • Assess the indexing and query set up to reduce the impact of the aggregation jobs on the query latency, which occasionally affected the user experience.
  • Advice on how to run benchmarking experiments in Elasticsearch in a timely and reproducible manner, to evaluate alternative queries and settings.

The Solution

Cluster Configuration

We reviewed the Amazon OpenSearch Service settings, monitored the cluster statistics and usage over time, and validated capacity planning. The outcome was a list of cost-saving improvements worth up to 10,000 euro a month. Finding the Elasticsearch instance types overprovisioned in CPU and memory, we proposed cluster configurations more adequate to the actual needs, and guidelines for their iterative evaluation without affecting the production system. Changes to the subscription model and the storage setup (EBS downsizing and use of UltraWarm) could achieve further savings. Moreover, the recommended enhancements to cluster security and resilience could reduce the hidden costs of operational interruptions.

Maintenance

Infrastructure as Code is a key element of DevOps and a kreuzwerker best practice. We provided our client’s engineers with a Terraform template to provision, update, and teardown an Amazon OpenSearch cluster in a given environment. The template included a Lambda function to bootstrap data from snapshots stored in Amazon S3, automating the creation of ephemeral clusters for performance tests and QA.

We also advised on improvements to data retention management, suggesting discontinuing the home-brew script in place in favour of the built-in Amazon OpenSearch feature Index State Management, which enables more sophisticated rollover and deletion policies - and one less script to maintain.

Indexing and query

We tackled the query latency problem by reviewing the index set up first. We discussed a different sharding strategy with the team for their index-heavy use case, recommending bigger and more evenly distributed shards to handle the load caused by the Spark jobs. Another series of small but impactful changes to the index mapping and settings could lead to further improvements. Then, we evaluated alternative options to reduce the impact of the jobs on the cluster performance, ranging from identifying better time windows for their execution to rethinking the aggregation process using Elasticsearch features such as rollups or transforms.

Benchmarking

We created a runbook to semi-automate benchmarking experiments in Amazon OpenSearch and increase trust in the results. Based on our experience and best practices, we tailored the runbook to the click and impression analytics system and reviewed it with the team to underline the repeatability and reproducibility of the methodology.

The Upshot

Without question, our client made the right choice in putting Elasticsearch at the core of their analytics system for Digital Marketing data. Using Amazon OpenSearch Service did speed up time-to-market, but left room for optimization. As certified AWS and Elasticsearch partners, kreuzwerker worked together with the client’s engineers to increase both the operational and cost-efficiency of the system. The result was an actionable list of configuration and process improvements expected to nearly halve AWS costs, along with tools to enhance automation and confidence in maintenance operations.