Exercises for the Elastic Certified Engineer Exam: Model Data into Elasticsearch

With the exercises in this blog post you should have a clear overview of the topics covered in the “Mappings and Text Analysis” Exam Objective of the Elastic certification exam.
01.08.2019
Tags

In the previous posts (#1 and #2) of this series, I proposed several exercises in preparation for the Elastic Certified Engineer exam. So far, we have practised operating and indexing data into an Elasticsearch cluster. Today we will work on data mapping and text analysis.

DISCLAIMER: All exercises in this series are based on the Elasticsearch version currently used for the exam, that is, v6.5. Please always refer to the Elastic certification FAQs to check the latest exam version.

Mappings and Text Analysis

The mapping of an index describes the schema of its documents and how to search them. In practice, a mapping tells Elasticsearch what the data types of a document’s fields are, which fields are searchable, how to analyse a certain text field, and so forth. This section will help you build confidence with all these settings and features.

Three practical pointers before we start. First, the exercise instructions are those framed into code blocks. I recommend you to copy-paste them into your Kibana Dev Tools and work directly from there. Second, all exercises require a running cluster to work on. Check out my elastic-training-repo on GitHub to start off. Third, our training dataset consists of lines from Hamlet by William Shakespeare, and have the following structure:

{
  "line_number": "String",
  "speaker": "String",
  "text_entry": "String",
}

All set, ready, go!

Exercise 1

In this exercise, you will specify a mapping for the training dataset. Nothing too fancy, it’s going to be easy.

# ** EXAM OBJECTIVE: MAPPINGS AND TEXT ANALYSIS **
# GOAL: Create a mapping that satisfies a given set of requirements
# REQUIRED SETUP:
  (i)   a running Elasticsearch cluster with at least one node and a Kibana instance,
  (ii)  the cluster has no index with name `hamlet`, 
  (iii) the cluster has no template that applies to indices starting by `hamlet`

After creating an index, define the type of its fields into the mapping. You can find some examples here.

⚠ : Indices in Elasticsearch used to support multiple subpartitions called types, which have been deprecated in v7.x and will be completely removed in v8.x (see “Removal of mapping types”). In Elasticsearch v6.x, an index can have only one type, preferably named _doc.

# Create the index `hamlet_1` with one primary shard and no replicas
# Define a mapping for the default type "_doc" of `hamlet_1`, so that (i) the type has three fields, named `speaker`, `line_number`, and `text_entry`, (ii) `speaker` and `line_number` are unanalysed strings

Furthermore, I want you to disable aggregations on the “line_number” field, as I couldn’t think of any valuable statistic that we could get out of unique, progressive line numbers. Note that by disabling aggregations, you are going to save some resources.

# Update the mapping of `hamlet_1` by disabling aggregations on `line_number`

Let’s add store some data into the index.

# Add some documents to `hamlet_1` by running the following _bulk command
PUT hamlet-1/_doc/_bulk
{"index":{"_index":"hamlet_1","_id":0}}
{"line_number":"1.1.1","speaker":"BERNARDO","text_entry":"Whos there?"}
{"index":{"_index":"hamlet_1","_id":1}}
{"line_number":"1.1.2","speaker":"FRANCISCO","text_entry":"Nay, answer me: stand, and unfold yourself."}
{"index":{"_index":"hamlet_1","_id":2}}
{"line_number":"1.1.3","speaker":"BERNARDO","text_entry":"Long live the king!"}
{"index":{"_index":"hamlet_1","_id":3}}
{"line_number":"1.2.1","speaker":"KING CLAUDIUS","text_entry":"Though yet of Hamlet     our dear brothers death"}
{"index":{"_index":"hamlet_1","_id":4}}
{"line_number":"1.2.2","speaker":"KING CLAUDIUS","text_entry":"The memory be green,    and that it us befitted"}

Do you remember the last time you have changed a datatype in an existing index? You don’t, right? That’s because it is forbidden for most cases in order to prevent you from applying inconsistent settings to existing data. Some exceptions are adding new fields to the document or a multi-field to a field. For all other cases, the way to update your mapping is by using the Reindex API - if you came here from my previous blog post, you should have already practised with it. Speaking of multi-fields, this is a pretty neat feature of Elasticsearch, which allows you to index the same field in different ways for different purposes. Elastic published a good webinar about it.

Enough of talking: let’s put what we have just discussed into practice.

# Create the index `hamlet_2` with one primary shard and no replicas 
# Copy the mapping of `hamlet_1` into `hamlet_2`, but also define a multi-field for
  `speaker`. The name of such multi-field is `tokens` and its data type is the
  (default) analysed string
# Reindex `hamlet_1` to `hamlet_2`
# Verify that full-text queries on "speaker.tokens" are enabled on `hamlet_2` by
  running the following command: 
GET hamlet_2/_search 
{
  "query": {
    "match": { "speaker.tokens": "hamlet" }
}}

Exercise 2

In the Elasticsearch world, you are highly encouraged to avoid relational data as much as possible, mainly for performance reasons. A nested object in your document slows down your queries significantly. Leave parent/child relationships alone, which, besides the additional indexing complexity, lead to queries up to ten times slower than with nested objects. That said, there might be use cases when such relationships are preferable (see the flowchart below), especially if index-time performance is more important than query-time performance.

How to Model Relational Data in Elasticsearch
How to Model Relational Data in Elasticsearch (Image Source).

The next exercise will make sure that you can model relational data in Elasticsearch.

# ** EXAM OBJECTIVE: MAPPINGS AND TEXT ANALYSIS **
# GOAL: Model relational data
# REQUIRED SETUP:
  (i)   a running Elasticsearch cluster with at least one node and a Kibana instance,
  (ii)  the cluster has no index with name `hamlet`, 
  (iii) the cluster has no template that applies to indices starting by `hamlet`

We are going to index a new type of document than from the ones we’ve seen so far. Such documents contain information about Hamlet’s characters, notably their name and relationships with other characters in the play. The document structure is as follows:

{
  "name": "String",
  "relationship": [{
    "name": "String",
    "type": "String" 
  }]
}

Let’s index some data. In particular, let’s specify that Hamlet has a friend named Horatio and is the son of Gertrude, and that King Claudius is the uncle of Hamlet.

# Create the index `hamlet_1` with one primary shard and no replicas
# Add some documents to `hamlet_1` by running the following _bulk command
PUT hamlet_1/_doc/_bulk
{"index":{"_index":"hamlet_1","_id":"C0"}}
{"name":"HAMLET","relationship":[{"name":"HORATIO","type":"friend"},{"name":"GERTRUDE","type":"mother"}]}
{"index":{"_index":"hamlet_1","_id":"C1"}}
{"name":"KING CLAUDIUS","relationship":[{"name":"HAMLET","type":"nephew"}]}

How many friends of Gertrude have we defined? None, right? Well, not quite.

# Verify that the items of the `relationship` array cannot be searched independently - 
  e.g., searching for a friend named Gertrude will return 1 hit
GET hamlet_1/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "relationship.name": "gertrude" } },
        { "match": { "relationship.type": "friend" } }
      ]
}}}

Why is that? Short answer: because of the hack made by Elasticsearch to store arrays of objects into Lucene. A longer explanation - and very informative read - is on the Elastic documentation. To achieve correct cross-object matching, you need to change the datatype of the “relationship” field to “nested”.

⚠ : You cannot change the datatype of a field that has been already mapped. You will need to create a new index with the updated mapping, and reindex the old index to it.

# Create the index `hamlet_2` with one primary shard and no replicas
# Define a mapping for the default type "_doc" of `hamlet_2`, so that the inner objects of the `relationship` field (i) can be searched independently, (ii) have only unanalyzed fields[”](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/nested.html)
# Reindex `hamlet_1` to `hamlet_2`
# Verify that the items of the `relationship` array can now be searched independently - e.g., searching for a friend named Gertrude will return no hits 
GET hamlet_2/_search 
{
  "query": {
    "nested": {
      "path": "relationship",
      "query": {
        "bool": {
          "must": [
            { "match": { "relationship.name": "gertrude" }},
            { "match": { "relationship.type":  "friend" }} 
          ]
}}}}}

So far, “hamlet_2” contains only documents related to Hamlet’s characters. Let’s add more documents representing lines of the play.

# Add more documents to `hamlet_2` by running the following _bulk command
PUT hamlet_2/_doc/_bulk
{"index":{"_index":"hamlet_2","_id":L0}}
{"line_number":"1.4.1","speaker":"HAMLET","text_entry":"The air bites shrewdly; it is very cold."}
{"index":{"_index":"hamlet_2","_id":L1}}
{"line_number":"1.4.2","speaker":"HORATIO","text_entry":"It is a nipping and an eager air."}
{"index":{"_index":"hamlet_2","_id":L2}}
{"line_number":"1.4.3","speaker":"HAMLET","text_entry":"What hour now?"}

A character can have many lines. You can think of it as a one-to-many relationship between character documents and line documents. In Elasticsearch, you can model it as a parent/child relation by using the join datatype, where the character documents play the parent role. I’m going to ask you to model such a relation into the mapping of a new index.

# Create the index `hamlet_3` with only one primary shard and no replicas
# Copy the mapping of `hamlet_2` into `hamlet_3`, but also add a join field to define 
  a relation between a `character` (the parent) and a `line` (the child). The name of 
  such field is "character_or_line" 
# Reindex `hamlet_2` to `hamlet_3`

At the moment, the “character_or_line” relationship exists only in the mapping but not in the data. We are going to create a script for it. The script will update all the line documents associated with a given character, which is specified as a script parameter. The update consists in setting the “character_or_line” join field in such a way as to bind the line documents to the right character document. For example, consider the following line document.

{
  "line_number": "1.2.1",
  "speaker":"KING CLAUDIUS",
  "text_entry":"Though yet of Hamlet our dear brothers death"
}

Given that the character document of King Claudius has been indexed with id “C1”, after the script execution the document will have the new field below:

"character_or_line": { "name": "line", "parent": "C1" }

All right, let’s create and apply the script.

# Create a script named `init_lines` and save it into the cluster state. The script 
  (i)   has a parameter named `characterId`,  
  (ii)  adds the field `character_or_line` to the document,  
  (iii) sets the value of `character_or_line.name` to "line" ,  
  (iv)  sets the value of `character_or_line.parent` to the value of the `characterId` 
        parameter
# Update the document with id `C0` (i.e., the character document of Hamlet) by adding 
  the field `character_or_line` and setting its `character_or_line.name` value to 
  "character" 
# Update the documents in `hamlet_3` that have "HAMLET" as a `speaker`, by running the 
  `init_lines` script with `characterId` set to "C0"

⚠ : A join datatype works only if the parent document and all its children are indexed on the same shard (see “Parent-Join Restrictions”). For this reason, when indexing a child document, you typically have to set the routing field with the id of its parent. However, this wasn’t necessary for our exercise because we created an index with only one shard.

We can use the has_parent query to check whether our script was successful.

# Verify that the last operation was successful by running the query below
GET hamlet_3/_search
{
  "query": {
    "has_parent": {
      "parent_type": "character",
      "query": {
        "match": { "name": "HAMLET" }
      }
}}}

Exercise 3

Elasticsearch supports several strategies to analyse text fields and unleash the power of full-text searches. Furthermore, it provides a simple, modular mechanism to build analyzers that best fit your needs. In this last exercise, you’ll practice with both built-in and custom analyzers.

# ** EXAM OBJECTIVE: MAPPINGS AND TEXT ANALYSIS **
# GOAL: Add built-in text analyzers and specify a custom one
# REQUIRED SETUP:
  (i)   a running Elasticsearch cluster with at least one node and a Kibana instance,
  (ii)  the cluster has no index with name `hamlet`, 
  (iii) the cluster has no template that applies to indices starting by `hamlet`

Elasticsearch ships with a lot of built-in analyzers to process your text data, Among these, language analyzers take into account the characteristics - e.g., stopwords, stemming - of the specified language. The language supported by Elasticsearch are many, and - surprise, surprise - English is among them.

# Create the index `hamlet_1` with one primary shard and no replicas
# Define a mapping for the default type "_doc" of `hamlet_1`, so that (i) the type has 
  three fields, named `speaker`, `line_number`, and `text_entry`, (ii) `text_entry` is 
  associated with the language "english" analyzer
# Add some documents to `hamlet_1` by running the following _bulk command
PUT hamlet_1/_doc/_bulk
{"index":{"_index":"hamlet_1","_id":0}}
{"line_number":"1.1.1","speaker":"BERNARDO","text_entry":"Whos there?"}
{"index":{"_index":"hamlet_1","_id":1}}
{"line_number":"1.1.2","speaker":"FRANCISCO","text_entry":"Nay, answer me: stand, and unfold yourself."}
{"index":{"_index":"hamlet_1","_id":2}}
{"line_number":"1.1.3","speaker":"BERNARDO","text_entry":"Long live the king!"}
{"index":{"_index":"hamlet_1","_id":3}}
{"line_number":"1.2.1","speaker":"KING CLAUDIUS","text_entry":"Though yet of Hamlet our dear brothers death"}

If no built-in analyzer fulfills your needs, you can always create a custom one. As an Elastic Certified Engineer, you are expected to know how to do it.

⚠_ : During the exam, I highly encourage you to test your custom analyzer on synthetic data by using the _testing API.

# Create the index `hamlet_2` with one primary shard and no replicas
# Add to `hamlet_2` a custom analyzer named `shy_hamlet_analyzer`, consisting of 
 (i)   a char filter to replace the characters "Hamlet" with "[CENSORED]",  
 (ii)  a tokenizer to split tokens on whitespaces and columns,  
 (iii) a token filter to ignore any token with less than 5 characters 
# Define a mapping for the default type "_doc" of `hamlet_2`, so that (i) the type has 
  one field named `text_entry`, (ii) `text_entry` is associated with the 
  `shy_hamlet_analyzer` created in the previous step

You’re almost done! The last task is to apply the custom analyzer to real data.

# Reindex the `text_entry` field of `hamlet_1` into `hamlet_2`
# Verify that documents have been reindexed to `hamlet_2` as expected - e.g., by 
  searching for "censored" into the `text_entry` field

Conclusions

With the exercises in this blog post you should have a clear overview of the topics covered in the “Mappings and Text Analysis” Exam Objective of the Elastic certification exam. As always, the instructions-only versions of the exercises are available in this Github repo.

The next blog post of this series will also be the last one, and we’ll focus on queries and aggregations. See you in a few weeks!


Credits for cover image go to: Unsplash