Ship often! Ship fast! Ship safely! - That was the goal of the cooperation between ZTech, the digital department of the Ziegert Group, an established real estate company with more than 35 years of experience, and kreuzwerker. ZTech had successfully built a series of digital platforms for managing real estate assets over the course of the last several years and was ready to take it to the next level: Transforming the existing single-tenant systems into multi-tenant SaaS platforms.
kreuzwerker provides outstanding service and partnership to support our growth. By introducing good practices, infrastructure improvements and sharing their expertise, we were capable to scale up and achieve excellence. Bruno Jensen, Engineering Manager_
The existing setup already had perfect preconditions: Modern cloud-native, event-driven serverless architectures, several cross-functional teams working together in an agile way and a clear product vision. However, the transformation from a single-tenant to a multi-tenant solution is not easy. Thus, ZTech asked kreuzwerker to help them prepare for the future: Create the technical foundation to allow for increased speed, throughput and quality in the delivery, and empower the existing development teams to deploy with confidence and be ready for further growth.
Teams from ZTech and kreuzwerker worked together closely to implement different measures comprised of tools, processes and people. This allowed shipping to be faster, more frequent and safer. It enabled software to be delivered more quickly because automation was increased with Infrastructure as Code, the CI/CD tooling was migrated to a more effective solution and cognitive load was reduced. Setting the preconditions and subsequently migrating from the traditional GitFlow process to Continuous Deployment allowed for more frequent delivery. And lastly, shifting left on security and empowering teams to take complete ownership and adopt a DevOps mindset led to safer delivery.
Remove blockers by implementing automation with Infrastructure as Code
One major factor causing a high lead time, i.e. the time it takes from code being committed to code being running in production, was that changes often required manual modifications of the AWS infrastructure components. Full automation is often not a main concern when getting started with a new product, and might even turn out to be a misled effort when scope changes very frequently. However, at a certain point the system and its interdependencies become so complicated that correct execution is very time consuming. This is especially true for a serverless architecture where infrastructure components such as queues, streams and databases make up a central part of the application logic itself, and when changes to the application code often need to be in sync with the infrastructure code and vice-versa.
To achieve the desired level of automation, Infrastructure as Code with AWS Cloud Development Kit (CDK) and TypeScript was introduced. After using AWS Resource Manager to create an inventory of all the existing resources, they were either imported into existing stacks or created from scratch and put into production with appropriate migration and cutover processes. The infrastructure code was included in the application’s code repositories, and management of the resources’ complete lifecycle was integrated into the application’s build and deployment pipelines. Afterwards, all changes could be made in sync and rolled out in a repeatable, secure and reliable way, allowing for frequent and small changes by anyone.
Why AWS CDK as Infrastructure as Code tool?
When it comes to Infrastructure as Code tools, several viable alternatives exist, e.g. Terraform, Pulumi, AWS CDK or AWS CloudFormation. Identifying AWS CDK as the best tool for the job turned out to be quite straightforward in the case of ZTech due to the following reasons:
- Ease adoption: Both the React frontends and the Node.js backends were implemented with TypeScript, which happens to be a first-class citizen in AWS CDK. Thus, all developers were able to quickly get up to speed and feel comfortable from the start.
- Reduce overall complexity: One of the main frameworks already in place was the Serverless Framework, which uses AWS CloudFormation under the hood. Since AWS CDK is a native AWS tool that also relies on AWS CloudFormation, the total number of tools to manage and understand could be kept at a minimum.
- Support secure and stable infrastructure: The teams worked in an autonomous and cross-functional way, covering the complete application lifecycle without relying on a dedicated platform or DevOps team. Since AWS CDK provides a lot of abstractions with sensible defaults, it allows developers to easily and confidently create complex AWS infrastructure components adhering to best practices.
- Ensure long-term maintainability and stability: With ever increasing complexity in infrastructure setups, it becomes more and more important to treat them like any other piece of software. This is possible with AWS CDK since it allows infrastructure to be described programmatically, and comes with built-in test automation capabilities.
Increase developer productivity by replacing CI/CD tooling
Another unnecessary delay was caused by the tooling used for the build and deployment pipelines itself - AWS CodeBuild and AWS CodePipeline. While these two built-in AWS services are a great choice for small teams getting started on AWS, they are not the best fit for larger teams with more advanced processes.
Thus, kreuzwerker suggested switching to the native CI/CD offering of the version control system already being used - in this case GitHub and GitHub Actions. Not only did it offer a more advanced feature set, it also integrated smoothly into the teams’ regular development workflow, reducing context switches and thus increasing developer productivity. As the first step in the migration effort, kreuzwerker consultants moved one service’s pipeline as a proof-of-concept. After presenting it to the teams, discussing questions and concerns and resolving potential blockers, the teams were convinced by the advantages and in a shared effort, all pipelines were successfully moved to GitHub Actions, resulting in more flexible and faster pipelines.
Reduce cognitive load by reducing code complexity
The last major step to ship faster was to reduce overall complexity where possible, reducing teams’ cognitive load and making it easier for future developers to apply changes in a fast and confident way. Both the overall architecture approach as such, and the individual services’ implementations were analyzed. A thorough review by kreuzwerker’s Serverless experts came to the conclusion that the chosen architecture approach was indeed a good fit for the specific business problems that needed to be solved. Thus, it was not reconsidered or fundamentally changed. However, the analysis revealed several low-level parts that could be simplified.
Why is serverless a good fit?
- Serverless as an architecture approach is perfect for asynchronous, event-driven use cases. In the case of ZTech, it was a good match since critical application workflows connected different, partially 3rd party, systems and the asynchronous, event-driven implementation with Amazon DynamoDB Streams, Amazon S3 Event Notifications, Amazon SQS and Amazon SNS allowed for a resilient implementation.
- Serverless as in Function-As-A Service (FaaS), e.g. AWS Lambda comes with a lot of benefits in terms of costs, scaling and performance. A pay by usage model, automatic scaling based on demand and build-in high-availability were all in favor of selecting this approach after thoroughly analyzing and understanding common traffic and usage patterns.
- Serverless as in “transparent servers” allows developers to focus on delivering business value instead of spending time on non-value adding tasks such as maintaining infrastructure, fixing preventable security issues or performing backup processes. This can be achieved with managed services such as AWS Fargate or Amazon Aurora Serverless, and is often a sensible default choice.
kreuzwerker consultants used their preexisting knowledge of the given frameworks and their longtime AWS and Serverless experience to support the teams in replacing custom implementations with out-of-the box solutions, identifying and removing obsolete parts of the code base and making use of more native features, which resulted in more maintainable code.
Resolve issues quickly by improving observability setup
After one of the major preconditions - automation - had been established, the only thing left to do to be able to drastically increase the deployment frequency, was to tighten the observability setup. A good observability setup is crucial to maintaining a low mean time to restore (MTTR), especially when releasing every commit directly to production. All applications already had logs, metrics and tracing in place. But one main issue remained: Production issues were often detected by diligent manual checks after a release, which would not be feasible anymore with more frequent deployments. Additionally, the frequency of production issues was expected to increase as well with more frequent deployments. To lay the foundation for a speedy incident resolution, usability of root cause analysis on production issues needed to be improved as well.
To detect and alert about failures automatically, kreuzwerker collaborated with ZTech to define and implement Amazon CloudWatch alarms based on technical and business KPIs. The alarms were integrated it into Microsoft Teams so that developers on-call were immediately notified in case something went wrong. To ease root cause analysis, dashboards were created based on default and custom metrics, log collection was standardized and noise-generating false positives were significantly reduced by carefully reviewing and adapting error scenario handlings. With all this in place, everyone involved felt ready to move to more frequent deployments.
Go all the way by introducing continuous deployment
With all the previous actions being successfully done, the applications were moved from the previous GitFlow process to a Continuous Deployment process. On a technical level, this simply boiled down to deleting the develop branch, making the main branch the default branch and adapting the deployment pipeline to deploy straight to production after the deployment to staging was successful.
Keep production failures low by shifting left on security
With the now significantly raised tempo, the stability of the system might be in danger. Even though automation, observability and simplification already acted as risk mitigators, additional actions were taken to be on the safe side. One of them was the adoption of a shift left approach in terms of security. To keep change fail percentage low, bugs are aimed to be caught as early as possible and preferably not in production. This is especially critical for security related issues. To support the shift left approach, several actions were taken: Making it close to impossible to perform insecure operations by putting guardrails in place, making it as hard as possible to go against best practices by providing AWS CDK blueprints with sensible defaults and reference implementations for IAM policies adhering to the principle of least privilege, and making it super easy to always stay ahead by extending the build and deployment pipelines with automatic security alerts and updates using GitHub Action’s Dependabot feature.
Ensure longevity by increasing confidence and trust
Last but not least, shipping safely is also a matter of feeling confident and of trusting in both the team’s and the system’s capabilities. This might turn out to be more challenging than expected if team members are faced with an unfamiliar technology stack in combination with a lack of access to initial decisions. To increase the level of confidence across all teams and to set the foundation for future growth, kreuzwerker invested heavily in knowledge sharing and enablement.
Several formal activities such as bi-weekly interactive brown bag sessions and a DevOps seat rotation were established to provide theoretical knowledge combined with hands-on experience. However, the main leverage was the continuous and ongoing collaboration: kreuzwerker consultants worked integrated into the ZTech development teams, initiated and participated in mob and pair programming sessions, and were always available for on-demand mentoring sessions and individual coaching. Over time, the teams became more confident and kreuzwerker faded out, being replaced by ZTech developers who were now able to act as multipliers on topics such as DevOps mindset, specific technical details of the used services and AWS best practices.
With the tooling we have in place now, largely thanks to kreuzwerker, it will become possible for us to improve speed, throughput, and quality. Jonathan Hansen, Head of Engineering & Agile
The combination of all the steps - among them increased automation, increased developer productivity and increased confidence - set ZTech up for the future: Be able to deliver high-quality software at a high pace for the ambitious product roadmap and the technical challenges ahead. The close collaboration between ZTech and kreuzwerker left the teams empowered to own and expand their existing Serverless stack on AWS, the introduced DevOps tools and the newly established processes and approaches. Teams at ZTech are prepared to ship fast, often and safely for years to come.