
Monkeying with Infrastructure: How Netflix Dealt with EC2 Reboots
As we’ve seen recently in the news, the shellshock vulnerability, also know as Bash bug, has become a new issue for servers around the world. Outside of that, there was also a Xen Security Announcement (XSA) that affected approximately 10% of Amazon EC2 instances worldwide.
Amazon was quick to note that these are two unrelated issues, and that the Bash bug was not responsible for the reboots. This was comforting to customers, but at the same time there was still the issue of Xen reboots required across numerous regions. As you may already know, many cloud based services including Netflix are backed by the popular EC2 environment.
Preparing for Chaos, Netflix Style
If you haven’t already heard of it, Netflix uses a system that they call the Chaos Monkey. This interesting little application runs within their cloud application environment and brings down instances to ensure that the overall application still stays up.
Where most developers would run these tests in their UAT or QA environment, the interesting tactic adopted by Netflix is that they run this against their production cloud. Yes, you read that correctly. It is running against the very same set if instances (many, many, many instances) which are responsible for delivering their customer content.
By creating a resilient, scale-out application environment, Netflix is able to destroy various parts of the active pool of resources without taking down the service. When the notice came from Amazon that EC2 was going to get rolling reboots, this was really no different than what Netflix experiences on a day-to-day basis at their own hands thanks to the Chaos Monkey application.
The one thing that they weren’t sure on was the volume that would be affected. Luckily they were prepared, and this great blog post helps to describe what the Netflix engineers did in order to get through the big weekend.
It’s a great read, and a nod towards the methodologies put in place at Netflix. We don’t all have to run our own Chaos Monkey, but if we architect towards it, it certainly wouldn’t be a bad thing.