Recently, one of our small EC2 instances failed. While we had Nagios monitoring it, Nagios only provides alerts when services fail, or when the host goes down. In this case, the failure was on Amazon’s side- the hardware where our instance resided was failing.
We received the following email:
We have noticed that one or more of your instances are running on a host degraded due to hardware failure.
The host needs to undergo maintenance and will be taken down at 12:00 GMT on YYYY-MM-DD. Your instances will be terminated at this point.
The risk of your instances failing is increased at this point. We cannot determine the health of any applications running on the instances. We recommend that you launch replacement instances and start migrating to them.
Feel free to terminate the instances with the ec2-terminate-instance API when you are done with them.
Let us know if you have any questions.
Gee- thanks. Essentially- at this point our only option is to bring up a new instance and customize it so that we can replace the failing instance. We already had a (mostly) customized image for this.
The immediate problem was to ensure we had the data and configs from the failing instance that we could apply to the new instance. Since we did not have this scripted, we needed to go in and gather the info needed. Unfortunately, access was touch and go. We were able to discern the disk structure, partition layout, fstab, LVM attributes, existing users, etc. We quickly grabbed all of that info to be used later. We were fortunate in that our earlier design decision to place all data on EBS volumes meant that we could detach them, and reattach them to a new instance. What follows is the approximate method we used to replace the failing instance:
- Log on to instance and gather info.
- Snapshot the EBS stores just in case.
- Unmount EBS stores containing most data. One store unmounted cleanly, the other had to be forced. Not ideal, and risk of corrupted data- but better than losing all.
- Remove EBS attachment from this EC2 instance. We also had to force detach from this instance.
- Bring up new instance.
- Reconstruct server based on the info gathered above in #1
- Attach EBS to instance
- Mount and test applications
What did we learn?
- EBS stores for data is good.
- Standard image is good.
- Getting into a failing instance early is very helpful.
- Standard backups of configs and homedirs to another server is helpful.
- Ensure that only transient data is stored on the ephemeral stores (/dev/sdb and /dev/sdc). Backup anything there that is needed.
- Improve scripting for launching a new instance.
- Improve documentation of instance configuration and layout.
- Explore some of Amazon’s offerings to see if we can replace a failing instance automatically (i.e. AWS Monitoring in conjunction with Auto Scaling)
Overall this was fairly painless- but there is room for improvement.
Comments? How are other folks handling Disaster Recovery in the cloud? Let us know