Have you ever tried to seriously test all your highly sophisticated resilience accomplishments in cloud setups? Our DevOps invest a lot to achieve things like auto-scaling, failover and zero downtime deployments. Infrastructure is coded with e.g. CloudFormation templates, Chef cookbooks or Dockerfiles and it is monitored with complex metrics that are included in notification and action chains to provide SLAs for our customers. But all this is worthless if you never test it and proof that the concepts and setups are working. It turns out that testing cloud infrastructure isn't that easy like application testing because you have to manipulate technical infrastructure components that are normally out of your scope and abstracted away. Luckily Netflix provide their toolchain to tackle some of these challenges.
###The Simian Army
As written on their wiki the Simian Army consists of services (Monkeys) in the cloud for generating various kinds of failures, detecting abnormal conditions, and testing the ability to survive them. From the collection of different monkeys the Chaos Monkey is the most useful for us in the moment. It supports AWS and VSphere as IaaS backends and brings some very helpful types of chaos into the infrastructure. BTW even reading the source is a big fun due to the creative naming of stuff. Just to note one of my favorites the boolean property isBurnMoneyEnabled to take care how far your monkey should disturb the peace.
Please read the well written Quick-Start-Guide to get the monkey up and running. Some of the strategies really can cause trouble to a lot of existing off-the-shelf applications. You can mix up your cocktail of gifts in the chaos.properties file.
#Strategies simianarmy.chaos.shutdowninstance.enabled = true simianarmy.chaos.blockallnetworktraffic.enabled = true simianarmy.chaos.burncpu.enabled = false simianarmy.chaos.killprocesses.enabled = false simianarmy.chaos.nullroute.enabled = false simianarmy.chaos.failapi.enabled = false simianarmy.chaos.faildns.enabled = true simianarmy.chaos.faildynamodb.enabled = false simianarmy.chaos.fails3.enabled = false simianarmy.chaos.networkcorruption.enabled = true simianarmy.chaos.networklatency.enabled = true simianarmy.chaos.networkloss.enabled = true
As you can see here we like to use all the network problems that can appear. We use this mixture for one of our customer that needs applications syncing data via unreliable satellite connections. In fact their cruise liners data-centers are moving targets rolling around the globe.
The Chaos Monkey communicates via provided APIs to the IaaS and if needed via ssh to your server to doing it actions. Of course these are logged so that you can correlate created monkey chaos with your infrastructure logs and metrics.
If you coming into the recommended state to have a working cloud infrastructure with resilience features enabled the Simian Army is a very good starting point to test your setup and even to get an idea of the complexity of possible failure scenarios. Our recommendation is to take your first steps with these tools in a sandboxed playground because you probably will not survive the first shot. Good Luck!