It was a sunny August day when I claimed I would publish the conclusion to this blog series of Elasticsearch exercises “in a few weeks”. Time flies, doesn’t it? Sorry to keep you waiting.
Here is a quick recap: we started by operating and configuring a cluster, then we indexed some data into it, and, finally, we played with mappings and text analysis.
Today, we will practice with searches and aggregations, which are the last two Exam Objectives of the Elastic Certified Engineer exam to be covered.
DISCLAIMER: All exercises in this series are based on the Elasticsearch version currently used for the exam, that is, v7.2. Please always refer to the Elastic certification FAQs to check the latest exam version.
“You Know, for Search”
As the name suggests, searches are the heart of Elasticsearch… Don’t get distracted by the sheer amount of features that have been added in the past 10 years; “search” is still its primary function. Follow me in this 2-minute crash introduction.
Elasticsearch exposes its search capabilities via its rich Query DSL. A query is defined as a JSON object and can contain leaf or compound query clauses. You can use leaf clauses to look into a particular field for some desired values (text, numbers, dates), and use compound clauses to combine multiple queries with logical operators or to alter their behaviour. Each of these clauses can run either in a filter or a query context. You use the filter context when all you want to know is whether a document matches a clause. That’s a yes or a no. But if you want to know also how well that document matched the clause, then you need the query context to calculate the relevance score (a positive float). The higher the score, the more relevant the document is to the search.
The exercises in this section are designed to assess your proficiency in writing searches for Elasticsearch. But before we start, let’s set up your training environment:
- Spin up your cluster (my GitHub repo might come in handy for this task).
- Open Kibana and add the three sample data sets that come out of the box.
- Go to the Kibana Dev Tools page.
And off we go!
Exercise 1
We start by searching into analysed text fields, such as the message of an application log or the address field of a user profile. The way you do it in Elasticsearch is by writing full-text queries.
⚠ : You don’t remember what is an analysed text field and want to practice more with it? Check out Exercise 3 of the previous post of this series.
# ** EXAM OBJECTIVE: QUERIES **
# GOAL: Create search queries for analyzed text, highlight, pagination, and sort
# REQUIRED SETUP:
(i) a running Elasticsearch cluster with at least one node and a Kibana instance,
(ii) add the "Sample web logs" and "Sample eCommerce orders" to Kibana
# Run the next queries on the `kibana_sample_data_logs` index
Let’s search the kibana_sample_data_logs
index for logs that contain the string “Firefox” in their message. Because the data type of the message field is an analysed text, the standard query for performing full-text searches is the match query.
# Search for documents with the `message` field containing the string "Firefox"
What do you think would happen if you searched for “firefox” with a lowercase “f”? Nothing, right, because the standard analyzer applied to the message field will lowercase all tokens anyway.
By default, a query response can include up to ten results. But what if you wanted to show up to 50 results? And then what if you wanted to fetch the next 50?
# Search for documents with the `message` field containing the string "Firefox" and return (up to) 50 results.
# As above, but return up to 50 results with an offset of 50 from the first
⚠ : Deep pagination is highly inefficient when realised using the from
and size
parameters, as memory consumption and response time grow with the value of the parameters. The best practice is to use the search_after
parameter instead.
# Search for documents with the `message` field containing the strings "Firefox" or "Kibana"
Did you write a compound query to fulfil the instruction above? It wasn’t necessary. If the match query defines multiple terms (e.g., “firefox kibana”), then any document that matches at least one term is considered a valid response.
And if I want you to search for documents that match all of the query terms? Or two-thirds of them? Good news: you still don’t need compound queries, but only match query parameters to configure.
# Search for documents with the `message` field containing both the strings "Firefox" and "Kibana"
# Search for documents with the `message` field containing at least two of the following strings: "Firefox", "Kibana", "159.64.35.129"
When you are searching for a short word in a long text, it won’t be easy to spot the exact location of the match in your search results. Opportunely, Elasticsearch offers highlighters to make matches stand out more clearly. With the highlighting feature enabled, every search hit includes a highlight
element where every match is wrapped between - configurable - tags. For example, if I search for “Firefox” and “Kibana”, this is what a valid response might look like:
...
"_source" : {
"message" : "1.2.3.4 - - [2018-07-22T06:06:42.742Z] GET /kibana.tar.gz HTTP/1.1 200 8489 - Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Firefox/6.0a1"
},
"highlight" : {
"message" : [
"1.2.3.4 - - [2018-07-22T06:06:42.742Z] GET /<em>kibana</em>.tar.gz HTTP/1.1”,
"200 8489 - Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) <em>Firefox</em>/6.0a1"
]},
...
Ok then, let’s highlight something.
# Search for documents with the `message` field containing the strings "Firefox" or "Kibana"
# As above, but also return the highlights for the `message` field
# As above, but also wrap the highlights in "{{" and "}}"
A phrase is an ordered sequence of terms. If you need to search for documents containing that phrase, a match query won’t be enough. But there are other full-text queries out there.
# Search for documents with the `message` field containing the phrase "HTTP/1.1 200 51"
To conclude, I want you to sort results based on different sort orders and sort modes.
# Search for documents with the `message` field containing the phrase "HTTP/1.1 200 51", and sort the results by the `machine.os` field in descending order
# As above, but also sort the results by the `timestamp` field in ascending order
### Run the next queries on the `kibana_sample_data_ecommerce` index
# Search for documents with the `day_of_week` field containing the string "Monday"
# As above, but sort the results by the `products.base_price` field in descending order, picking the lowest value of the array
Exercise 2
We will now practice with term-level queries, searching for precise values in non-analysed fields. Also, we will combine multiple query clauses in more complex compound queries.
# ** EXAM OBJECTIVE: QUERIES **
# GOAL: Create search queries for terms, numbers, dates, fuzzy, and compound queries
# REQUIRED SETUP:
# (i) a running Elasticsearch cluster with at least one node and a Kibana instance,
# (ii) add the "Sample web logs" and "Sample flight data" to Kibana
### Run the next queries on the `kibana_sample_data_logs` index
The log documents in kibana_sample_data_logs
don’t contain only analysed text fields, but also structured data such as numbers (size in bytes, response codes), dates (timestamp), IP addresses and keywords. Usually, queries on structured data don’t contribute to the score of a response, but rather, they say whether a document should be in the results set or not.
⚠ : A best practice is to run term-level queries in the filter context. This will speed up search performance because Elasticsearch doesn’t need to calculate scores and can cache responses automatically. To run a query clause in the filter context, you need to wrap it into a filter
boolean query.
As a start, I want you to search for log documents with a 4xx Client Error response code. The response code is a number. Range queries, anyone?
# Filter documents with the `response` field greater or equal to 400 and less than 500
# As above, but add a second filter for documents with the `referer` field matching the string "http://twitter.com/success/guion-bluford"
Prefix queries allow you to search for text that begins with the given sequence of characters. You can use them, for instance, to search for URLs within a particular domain or to find all companies with a name starting with “kreuzwe”.
If you run a prefix query on an analysed field, then the prefix will be checked against every single token, and not on the string as a whole. As an example, assume that your analysed text field contains “elasticsearch is pure fun”. If you run the prefix query “fu” on that field, then you will get a match because of the “fun” token.
To achieve the same behaviour of a prefix query on analysed text, you must use another query.
# Filter documents with the `referer` field that starts by "http://twitter.com/success"
# Filter documents with the `request` field that starts by "/people"
Have you noticed that some documents in the index have the memory
field set to null? I want you to write a query to fetch all of them, and then another query that does the exact opposite. Documentation for both cases exists!
# Filter documents with the `memory` field containing any indexed value
# (opposite of above) Filter documents with the `memory` field not containing any indexed value
If you have made it to here, you have used boolean queries for writing filters and negations. Let’s have a look at the other two boolean types: the logical “or” and “and”. Remember that you can nest and combine as many boolean queries as you like, but also remember that this comes with resource and performance costs.
# Search for documents with the `agent` field containing the string "Windows" and the `url` field containing the string "name:john"
# As above, but also filter documents with the `phpmemory` field containing any indexed value
# Search for documents that have either the `response` field greater or equal to 400 or the `tags` field having the string "error"
# Search for documents with the `tags` field that does not contain any of the following strings: "warning", "error", "info"
Time to dust off range queries again, but now with some date math.
# Filter documents with the `timestamp` field containing a date between today and one week ago
Typos happn, and fuzzy matching is a human-friendly solution to deal with them. Elasticsearch supports fuzziness for full-text and term-level queries. Let’s use it on a keyword field.
### Run the next queries on the `kibana_sample_data_flights` index
# Filter documents with either the `OriginCityName` or the `DestCityName` fields matching the string "Sydney"
# As above, but allow inexact fuzzy matching, with a maximum allowed “Levenshtein Edit Distance” set to 2. Test that the query strings "Sydney", "Sidney" and "Sidnei" always return the same number of results
Exercise 3
With this last exercise, we cover what remains of the “Queries” Exam Objective.
# ** EXAM OBJECTIVE: QUERIES **
# GOAL: Use scroll API, search templates, script queries
# REQUIRED SETUP:
# (i) a running Elasticsearch cluster with at least one node and a Kibana instance,
# (ii) add the "Sample web logs" and "Sample flight data" to Kibana
If you need to fetch a huge number of documents with a single search, increasing the size
parameter may not be sufficient (there is a hard limit of 10K hits per response) and for sure is not resource-efficient (deep pagination problem). The recommended solution is to rely on the Elasticsearch scroll API.
# Search for all documents in all indices
# As above, but use the scroll API to return the first 100 results while keeping the search context alive for 2 minutes
# Use the scroll id included in the response to the previous query and retrieve the next batch of results
To introduce the next topic, let’s start with something we already did in Exercise 2.
### Run the next queries on the `kibana_sample_data_logs` index
# Filter documents with the `response` field greater or equal to 400
If you want to filter only 5xx Server errors, you will need to replace the “400” with a “500”, right? Right. And if you want to filter only 4xx Client errors, you will need to add an upper bound to the range query, right? Right. And if you want to add a second filter to select only documents tagged with “security”, you will need to change your query again, right? Right.
Now, imagine that you are developing an application that should support all the cases listed above, possibly more. One way to proceed is to build an ad-hoc query with hard-coded values for each case. Another, much better way is to use search templates and parameterize query values if not entire query clauses.
# Create a search template for the above query, so that the template (i) is named "with_response_and_tag", (ii) has a parameter "with_min_response" to represent the lower bound of the `response` field, (iii) has a parameter "with_max_response" to represent the upper bound of the `response` field, (iv) has a parameter "with_tag" to represent a possible value of the `tags` field
# Test the "with_response_and_tag" search template by setting the parameters as follows: (i) "with_min_response": 400, (ii) "with_max_response": 500 (iii) "with_tag": "security"
The search template notation supports some more advanced logic such as concatenation of array values, definition of defaults and conditional clauses. I find the last one particularly useful, and I’ll show you why.
# Update the "with_response_and_tag" search template, so that (i) if the "with_max_response" parameter is not set, then don't set an upper bound to the `response` value, and (ii) if the "with_tag" parameter is not set, then do not apply that second filter at all
# Test the "with_response_and_tag" search template by setting only the "with_min_response" parameter to 500
# Test the "with_response_and_tag" search template by setting the parameters as follows: (i) "with_min_response": 500, (ii) "with_tag": "security"
Aggregations
Not only does Elasticsearch has powerful search capabilities, but it is also extremely efficient in aggregating data to build analytic information. The Elasticsearch documentation groups the aggregations supported into four families:
- Metric, to compute metrics over a set of documents.
- Bucket, to build buckets of documents that meet the same criteria.
- Pipeline, to aggregate the output of other aggregations.
- Matrix, to produce matrix results.
In the only long exercise of this section, you will work with the first three aggregations families, which are the ones required for the Elastic Certified Engineer exam.
Exercise 4
Straight to the intro!
# ** EXAM OBJECTIVE: AGGREGATIONS **
# GOAL: Create metrics, bucket, and pipeline aggregations
# REQUIRED SETUP:
# (i) a running Elasticsearch cluster with at least one node and a Kibana instance,
# (ii) add the "Sample flight data" to Kibana
### Run the next queries on the `kibana_sample_data_flights` index
Metrics aggregations can extract values from documents and combine them into metrics. For the sake of this exercise, we will mainly focus on numeric metrics aggregations, but there is also more than that.
# Create an aggregation named "max_distance" that calculates the maximum value of the `DistanceKilometers` field
# Create an aggregation named "stats_flight_time" that computes stats over the value of the `FlightTimeMin` field
# Create two aggregations, named "cardinality_origin_cities" and "cardinality_dest_cities", that count the distinct values of the `OriginCityName` and `DestCityName` fields, respectively
⚠ : Keep in mind that cardinality aggregations over large dataset always return an approximation.
Bucket aggregations group different documents into buckets based on desired and configurable criteria. Although Elasticsearch v7.2 offers more than 20 bucket aggregations, this exercise covers only a few, starting from the terms and histogram ones.
# Create an aggregation named "popular_origin_cities" that calculates the number of flights grouped by the `OriginCityName` field
# As above, but return only 5 buckets and in descending order
# Create an aggregation named "avg_price_histogram" that groups the documents based on their `AvgTicketPrice` field by intervals of 250
You can write multiple aggregations within the same request, and you can also nest them. If you are not sure how to do it, this code snippet from the official documentation will help you out.
# Create an aggregation named "popular_carriers" that calculates the number of flights grouped by the `Carrier` field
# Add a sub-aggregation to "popular_carriers", named "carrier_stats_delay", that computes stats over the value of the `FlightDelayMin` field for the related bucket of carriers
# Add a second sub-aggregation to "popular_carriers", named "carrier_max_delay", that shows the flight having the maximum value of the `FlightDelayMin` field for the related bucket of carriers
A date histogram aggregation is nothing more than a histogram aggregation specialised for dates. Very handy when you want to observe how certain metrics vary over time.
⚠ : Since Elasticsearch v7.2, the interval
field of the date histogram aggregation has been deprecated in favour of the more explicit calendar_interval
and fixed_interval
.
# Use the `timestamp` field to create an aggregation named "flights_every_10_days" that groups the number of flights by an interval of 10 days
# Use the `timestamp` field to create an aggregation named "flights_by_day" that groups the number of flights by day
# Add a sub-aggregation to “flights_by_day”, named “destinations_by_day”, that groups the day buckets by the value of the `DestCityName` field
Normally, the response of a bucket aggregation includes only the number of documents into each bucket. If you want to include the most relevant documents in a bucket, you need to add a top hits sub-aggregation.
# Add a sub-aggregation to the sub-aggregation "destinations_by_day", named "popular_destinations_by_day", that returns the 3 most popular documents for each bucket (i.e., ordered by their score)
# Update “popular_destinations_by_day” to display only the `DestCityName` field in for each top hit object
Enter the pipeline aggregations: a powerful feature that allows for further computation on the results of other aggregations. The only difficult part of writing a pipeline aggregation is to understand how to specify the input of the pipeline into the buckets_path
parameter. I found this summary in the Elastic documentation, along with the follow-up example, particularly useful.
Let’s write our pipeline aggregations, and you are free to go!
# Remove the "popular_destinations_by_day” sub-sub-aggregation from “flights_by_day”
# Add a pipeline aggregation to "flights_by_day", named "most_popular_destination_of_the_day", that identifies the "popular_destinations_by_day” bucket with the most documents for each day
# Add a pipeline aggregation named "day_with_most_flights" that identifies the “flights_by_day” bucket with the most documents
# Add a pipeline aggregation named "day_with_the_most_popular_destination_over_all_days" that identifies the “flights_by_day” bucket with the largest “most_popular_destination_of_the_day” value
Conclusions
This blog post concludes my series of exercises to prepare for the Elastic Certified Engineer exam. In particular, we covered the two Exam Objectives “Queries” and “Aggregations”. The instruction-only version of all exercises is available on this on GitHub repo, which I’ll try to keep up-to-date. By the way, you are more than welcome to contribute to the repo, for example, with pull requests of new exercises.
Time to say goodbye. I received very positive feedback from this blog series, and I know for a fact that it helped many people to improve their knowledge and get their certification. I wish you all the same!
Until next time, secure your cluster, and stay Elastic.
Credits for cover image go to: Unsplash