How monitoring can improve the security and availability of systems

Customers are affected by availability and security vulnerability issues regarding their applications and infrastructure. This case study describes how we at kreuzwerker use monitoring solutions to improve the availability and overall security

06.12.2023

Partner

Lead Contact

Fabian Duft

blog@kreuzwerker.de

Open Contact Form

How monitoring can improve the security and availability of systems

For customers, problems with the availability and security of their applications and infrastructures are a nuisance.

In this case study, we show how we at kreuzwerker use monitoring solutions such as Datadog to optimize the availability and data security of our customers - for greater customer satisfaction and a strong relationship between kreuzwerker and our customers.

Availability

Problem

One of our customer’s major vulnerabilities is facing potential downtimes caused by an EC2 instances or application availability problems, such as those stemming from DoS attacks.

A specific incident highlighting this vulnerability occurred in early October, precisely at 2:23 am on a Saturday. A Confluence outage took place, prompting the on-call kreuzwerker operator to take action.

Solution

To maintain a robust system, we have established an on-call schedule for our customer, providing round-the-clock support in the event of any incidents. We ensure continuous (24/7) monitoring of our infrastructure, specifically focusing on EC2 instance metrics such as CPU usage, disk usage, available RAM and other important metrics. This monitoring is supported through our dedicated monitoring tools, Datadog and New Relic.

image-2023-11-13 15-41-21

Furthermore, we closely monitor specific metrics related to the customer’s Atlassian Java applications, including swapping, garbage collection, and heap allocations. In the event of any issues, our Datadog monitors promptly trigger alerts, seamlessly integrated with our messaging platform Slack and operations tool OpsGenie. This integration ensures that the employee on-call promptly receives the necessary notifications, as can be seen below:

Alerts

alerts 2

In case of the problem described above, a swift restart provided a temporary resolution, successfully bringing Confluence back to normal operation after 8 minutes. Ongoing monitoring revealed a continuous increase in CPU usage and load on the Confluence instance, which ultimately led to the outage. Upon detailed investigation of the logs, it was discovered that numerous requests from a user were triggering high-load tasks, notably PDF exports, through the proxy to Confluence:

results1

Despite notifying the user as well as the customer via email and a Service Desk ticket, the requests persisted several hours after the incident.

To prevent any additional outages, the user was subsequently blocked, which effectively resolved the performance issues.

Result

The customer subsequently endorsed our process, clarifying that the user’s requests were unintentional and automated.

The continuous monitoring of the customer’s infrastructure and applications around the clock therefore contributed to a consistently stable system, as evident from the subsequent uptime report:

results2