Exercises for the Elastic Certified Engineer Exam: Store Data into Elasticsearch

This is the second installation of a four-part series of exercises on how to prepare for the Elastic Certified Engineer exam.
16.07.2019
Tags

This is the second installation of a four-part series of exercises on how to prepare for the Elastic Certified Engineer exam. In the previous blog post, we practiced our ability to deploy, configure, and operate an Elasticsearch cluster. Today we’ll start working with documents covering the “Indexing Data” Exam Objective of the certification.

Before we begin, let’s get a few definitions out of the way. In Elasticsearch, the basic unit of information to persist data is a (JSON) document. Documents with shared purpose and characteristics can be collected into an index. For example, one might have an index for application logs, one for system events, and another for movie data. The mapping of an index defines the schema of its documents and the way they should be searched. Finally, the process of storing data into Elasticsearch as documents and making it searchable is called indexing.

So much with the theory. Let’s get our hands dirty, shall we?

DISCLAIMER: All exercises in this series are based on the Elasticsearch version currently used for the exam, that is, v6.5. Please always refer to the Elastic certification FAQs to check the latest exam version.

Index data

The exercises that follow will test your ability to create, update and delete indices and documents in Elasticsearch. But before that, we need a running cluster. You can quickly start one off by using the resources in the GitHub repo created for this blog series.

Exercise 1

Indices APIs and scripts are your best allies to manipulate indices in Elasticsearch. You can practice with both by just following the exercise instructions in the code blocks below. (pro tip: copy-paste the instructions into your Kibana Dev Tools console and work directly from there)

# ** EXAM OBJECTIVE: INDEXING DATA **
# GOAL: Create, update and delete indices while satisfying a given set of requirements
# REQUIRED SETUP: 
  (i)   a running Elasticsearch cluster with at least one node and a Kibana instance,
  (ii)  the cluster has no index with name `hamlet`, 
  (iii) the cluster has no template that applies to indices starting by `hamlet`

We start by creating a new index, taking into account its scalability, resiliency, and performance. More precisely, I want you to configure the index with one primary shard in order to avoid oversharding, and two replicas to increase failover and read performance.

⚠ : Since Elasticsearch v7.x, the number of shards per index is set to one by default (see “Elasticsearch 7.0.0 released”).

# Create the index `hamlet-raw` with one primary shard and three replicas

Now, index your first document in “hamlet-raw” - of course, within its default type “_doc”.

⚠ : Indices in Elasticsearch used to support multiple subpartitions called types, which have been deprecated in v7.x and will be completely removed in v8.x. (see “Removal of mapping types”). In Elasticsearch v6.x, an index can have only one type, preferably named _doc.

# Add a document to `hamlet-raw`, so that the document (i) has id "1", (ii) has default type, (iii) has one field named `line` with value "To be, or not to be: that is the question"

To update the document, you can choose between two APIs, depending on whether you want to update one document or multiple documents per time.

# Update the document with id "1" by adding a field named `line_number` with value "3.1.64"
# Add a new document to `hamlet-raw`, so that the document (i) has the id automatically assigned by Elasticsearch, (ii) has default type, (iii) has a field named `text_entry` with value "Whether tis nobler in the mind to suffer", (iv) has a field named `line_number` with value "3.1.66"
# Update the last document by setting the value of `line_number` to "3.1.65"
# In one request, update all documents in `hamlet-raw` by adding a new field named `speaker` with value "Hamlet"

The update APIs are more than enough to add new fields to a document or to change some fields’ value. However, for more complex manipulation you will need more expressive power, which comes from Painless scripts (pun intended). Painless is a Groovy-style scripting language developed and optimised for Elasticsearch. Here are some script examples to give you some inspiration to solve the next exercise task.

# Update the document with id "1" by renaming the field `line` into `text_entry`

We need more data. We are going to use lines from Hamlet by William Shakespeare, structured in documents that specify the line number in the play, the line itself, and the speaker. Let’s use the _bulk API to create the index.

# Create the index `hamlet` and add some documents by running the following _bulk 
command
PUT hamlet/_doc/_bulk
{"index":{"_index":"hamlet","_id":0}}
{"line_number":"1.1.1","speaker":"BERNARDO","text_entry":"Whos there?"}
{"index":{"_index":"hamlet","_id":1}}
{"line_number":"1.1.2","speaker":"FRANCISCO","text_entry":"Nay, answer me: stand, and unfold yourself."}
{"index":{"_index":"hamlet","_id":2}}
{"line_number":"1.1.3","speaker":"BERNARDO","text_entry":"Long live the king!"}
{"index":{"_index":"hamlet","_id":3}}
{"line_number":"1.2.1","speaker":"KING CLAUDIUS","text_entry":"Though yet of Hamlet our dear brothers death"}
{"index":{"_index":"hamlet","_id":4}}
{"line_number":"1.2.2","speaker":"KING CLAUDIUS","text_entry":"The memory be green, and that it us befitted"}
{"index":{"_index":"hamlet","_id":5}}
{"line_number":"1.3.1","speaker":"LAERTES","text_entry":"My necessaries are embarkd: farewell:"}
{"index":{"_index":"hamlet","_id":6}}
{"line_number":"1.3.4","speaker":"LAERTES","text_entry":"But let me hear from you."}
{"index":{"_index":"hamlet","_id":7}}
{"line_number":"1.3.5","speaker":"OPHELIA","text_entry":"Do you doubt that?"}
{"index":{"_index":"hamlet","_id":8}}
{"line_number":"1.4.1","speaker":"HAMLET","text_entry":"The air bites shrewdly; it is very cold."}
{"index":{"_index":"hamlet","_id":9}}
{"line_number":"1.4.2","speaker":"HORATIO","text_entry":"It is a nipping and an eager air."}
{"index":{"_index":"hamlet","_id":10}}
{"line_number":"1.4.3","speaker":"HAMLET","text_entry":"What hour now?"}
{"index":{"_index":"hamlet","_id":11}}
{"line_number":"1.5.2","speaker":"Ghost","text_entry":"Mark me."}
{"index":{"_index":"hamlet","_id":12}}
{"line_number":"1.5.3","speaker":"HAMLET","text_entry":"I will."}

Now for the challenging part of the exercise, you’re going to create a more complicated script, which modifies documents in a different way depending on the value of a certain field. Also, you are going to store this script in the cluster state, so that you can always reuse it later for updating all documents. If you need some inspiration, have a look at the painless documentation or drop me a comment.

⚠ : During the exam always test your scripts before applying them to the original index, because, if something goes wrong, you won’t have an “undo” option to restore it. As a further precaution, you might also consider to create a backup of the cluster at the beginning of the exam.

# Create a script named `set_is_hamlet` and save it into the cluster state. The script (i) adds a field named `is_hamlet` to each document, (ii) sets the field to "true" if the document has `speaker` equals to "HAMLET", (iii) sets the field to "false" otherwise
# Update all documents in `hamlet` by running the `set_is_hamlet` script

Pretty convenient the “update_by_query” API, don’t you think? Do you also know how to use its counterpart for deletion?

# Remove from `hamlet` the documents that have either "KING CLAUDIUS" or "LAERTES" as the value of `speaker`

Exercise 2

Every index in Elasticsearch has its own settings and schema, including such things as the number of shards and replicas, the refresh period, and the type of data you are going to put into it. Of course, you can set this information when you create the index. But if you already know that multiple indices will have some characteristics in common, the best practice is to define an index template that applies to them automatically. A typical use case is time series data such as application logs, with new indices of the same type are created every given period.

# ** EXAM OBJECTIVE: INDEXING DATA ** 
# GOAL: Create index templates that satisfy a given set of requirements
# REQUIRED SETUP:
  (i)   a running Elasticsearch cluster with at least one node and a Kibana instance,
  (ii)  the cluster has no index with name `hamlet`, 
  (iii) the cluster has no template that applies to indices starting by `hamlet`

As you may have guessed, this exercise is all about index templates. Let’s create and test one, while keeping an eye on the documentation.

# Create the index template `hamlet_template`, so that the template (i) matches any index that starts by "hamlet_" or "hamlet-", (ii) allocates one primary shard and no replicas for each matching index 
# Create the indices `hamlet2` and `hamlet_test`
# Verify that only `hamlet_test` applies the settings defined in `hamlet_template`

Index templates are not set in stone but should evolve along with your data. For example, why don’t you add a mapping to the template and specify the type of each field? (spoiler alert: we’re going to practice with mappings in the next blog post of this series).

# Update `hamlet_template` by defining a mapping for the type "_doc", so that (i) the type has three fields, named `speaker`, `line_number`, and `text_entry`, (ii) `text_entry` uses an "english" analyzer

Updates to an index template are not automatically reflected on the matching indices that already exist. This is because index templates are only applied once at index creation time.

# Verify that the updates in `hamlet_template` did not apply to the existing indices 
# In one request, delete both `hamlet2` and `hamlet_test` 
# Create the index `hamlet-1` and add some documents by running the following _bulk command
PUT hamlet-1/_doc/_bulk
{"index":{"_index":"hamlet-1","_id":0}}
{"line_number":"1.1.1","speaker":"BERNARDO","text_entry":"Whos there?"}
{"index":{"_index":"hamlet-1","_id":1}}
{"line_number":"1.1.2","speaker":"FRANCISCO","text_entry":"Nay, answer me: stand, and unfold yourself."}
{"index":{"_index":"hamlet-1","_id":2}}
{"line_number":"1.1.3","speaker":"BERNARDO","text_entry":"Long live the king!"}
{"index":{"_index":"hamlet-1","_id":3}}
{"line_number":"1.2.1","speaker":"KING CLAUDIUS","text_entry":"Though yet of Hamlet our dear brothers death"}
# Verify that the mapping of `hamlet-1` is consistent with what defined in `hamlet_template`

Finally, let’s talk about dynamic mapping and dynamic templates. Dynamic mapping is the capability of Elasticsearch to index a document without knowing its schema beforehand. The tool will make an educated guess on the documents’ data types, so as to let you work with your data as soon as possible. Although useful during development, you should disable (or restrict) dynamic mapping in production for different reasons. For example it can lead to mapping conflicts, out of memory errors due to mapping explosion, and - more in general - a suboptimal use of your data and resources. Luckily, the alternative to dynamic mapping is not defining the mapping of every field of every index. Dynamic templates come to the rescue!

# Update `hamlet_template` so as to reject any document having a field that is not defined in the mapping 
# Verify that you cannot index the following document in `hamlet-1` 
PUT hamlet-1/_doc 
{ 
  "author": "Shakespeare" 
} 
# Update `hamlet_template` so as to enable dynamic mapping again
# Update `hamlet_template` so as to (i) dynamically map to an integer any field that starts by "number_", (ii) dynamically map to unanalysed text any string field
# Create the index `hamlet-2` and add a document by running the following command
POST hamlet-2/_doc/4
{
  "text_entry": "With turbulent and dangerous lunacy?",
  "line_number": "3.1.4",
  "number_act": "3",
  "speaker": "KING CLAUDIUS"
}
# Verify that the mapping of `hamlet-2` is consistent with what defined in `hamlet_template`

Exercise 3

This last exercise is not for the faint of heart. You will practice with aliases, reindexing, and data pipelines. Roll up your sleeves and have fun!

# ** EXAM OBJECTIVE: INDEXING DATA **
# GOAL: Create an alias, reindex indices, and create data pipelines
# REQUIRED SETUP:
  (i)   a running Elasticsearch cluster with at least one node and a Kibana instance,
  (ii)  the cluster has no index with name `hamlet`, 
  (iii) the cluster has no template that applies to indices starting by `hamlet`

As usual, let’s begin by indexing some data.

# Create the indices `hamlet-1` and `hamlet-2`, each with two primary shards and no replicas
# Add some documents to `hamlet-1` by running the following _bulk command
PUT hamlet-1/_doc/_bulk
{"index":{"_index":"hamlet-1","_id":0}}
{"line_number":"1.1.1","speaker":"BERNARDO","text_entry":"Whos there?"}
{"index":{"_index":"hamlet-1","_id":1}}
{"line_number":"1.1.2","speaker":"FRANCISCO","text_entry":"Nay, answer me: stand, and unfold yourself."}
{"index":{"_index":"hamlet-1","_id":2}}
{"line_number":"1.1.3","speaker":"BERNARDO","text_entry":"Long live the king!"}
{"index":{"_index":"hamlet-1","_id":3}}
{"line_number":"1.2.1","speaker":"KING CLAUDIUS","text_entry":"Though yet of Hamlet our dear brothers death"}
# Add some documents to `hamlet-2` by running the following _bulk command
PUT hamlet-2/_doc/_bulk
{"index":{"_index":"hamlet-2","_id":4}}
{"line_number":"2.1.1","speaker":"LORD POLONIUS","text_entry":"Give him this money and these notes, Reynaldo."}
{"index":{"_index":"hamlet-2","_id":5}}
{"line_number":"2.1.2","speaker":"REYNALDO","text_entry":"I will, my lord."}
{"index":{"_index":"hamlet-2","_id":6}}
{"line_number":"2.1.3","speaker":"LORD POLONIUS","text_entry":"You shall do marvellous wisely, good Reynaldo,"}
{"index":{"_index":"hamlet-2","_id":7}}
{"line_number":"2.1.4","speaker":"LORD POLONIUS","text_entry":"Before you visit him, to make inquire"}

An alias is a secondary name that you can assign to one or more indices. Simple as that. Why should you bother about it? This goes beyond the scope of this article, but let me drop here two click baits: “reindexing with zero downtime”, “enhance the usability of time-based data”.

# Create the alias `hamlet` that maps both `hamlet-1` and `hamlet-2` 
# Verify that the documents grouped by `hamlet` are 8

By default, if your alias includes more than one index, you cannot index documents using the alias name. But defaults can be overwritten, if you know how.

# Configure `hamlet-1` to be the write index of the `hamlet` alias

Honestly, how much have you enjoyed writing a script in the first exercise? That was rhetorical: you are going to write another one anyhow.

# Add a document to `hamlet`, so that the document (i) has id "8", (ii) has "_doc" type, (iii) has a field `text_entry` with value "With turbulent and dangerous lunacy?", (iv) has a field `line_number` with value "3.1.4", (v) has a field `speaker` with value "KING CLAUDIUS"
# Create a script named `control_reindex_batch` and save it into the cluster state. The script checks whether a document has the field `reindexBatch`, and (i) in the affirmative case, it increments the field value by a script parameter named `increment`, (ii) otherwise, the script adds the field to the document setting its value to "1"

In the first exercise, we could apply the script to all documents of an index by using the update API. That was possible only because we were adding new fields and updating their values, but it won’t work if you are trying to remove fields or change their type. Why? Because Elasticsearch wouldn’t be certain anymore about how to process the existing data, and your searches would no longer work as expected. How to apply these changes, then? If you are not screaming “REINDEX!” yet, then you should start now.

For a more efficient reindexing, a best practice is to temporarily configure the destination index to have no replicas as well as to disable “index.refresh_interval”.

# Create the index `hamlet-new` with two primary shards and no replicas
# Reindex `hamlet` into `hamlet-new`, while satisfying the following criteria: (i) 
apply the `control_reindex_batch` script with the `increment` parameter set to "1", 
(ii) reindex using two parallel slices

Oh, one more task with index aliases.

 # In one request, add `hamlet-new` to the alias `hamlet` and delete the `hamlet` and `hamlet-2` indices

The last topic of the “Indexing Data” Exam Objective is to “define and use an ingest pipeline that satisfies a given set of requirements, including the use of Painless to modify documents”. An ingest pipeline is a series of processors that enrich and transform data before indexing it into Elasticsearch. For a more narrative introduction to the topic, I recommend this article on the Elastic blog.

# Create a pipeline named `split_act_scene_line`. The pipeline splits the value of `line_number` using the dots as a separator, and stores the split values into three 
new fields named `number_act`, `number_scene`, and `number_line`, respectively

To verify that an ingest pipeline works as expected, you can rely on the _simulate pipeline API.

# Test the pipeline on the following document
{
  "_source": {
    "line_number": "1.2.3"
  }
}

Satisfied with the outcome? Go update your documents, then!

# Update all documents in `hamlet-new` by using the `split_act_scene_line` pipeline

Conclusions

This blog post offered you three exercises to practice with the “Indexing Data” Exam Objective of the Elastic Certified Engineer exam. As for all the exercises in this series, you can find the instructions-only version of the exercises also on this Github repo.

Next time we’ll focus on mappings and text analyzers. Until then, have a great time!


Credits for cover image go to: Unsplash