Amazon Web Services: My EBS is stuck!

Many of us were affected by the Amazon EBS issues at the end of October 2012. If you had EC2 instances in us-east-1, you were likely affected by the issues.

Many of us were affected by the Amazon EBS issues at the end of October 2012. If you had EC2 instances in us-east-1, you were likely affected by the issues. EBS volumes appeared “stuck”, snapshots would not complete, etc.
While the issues have been resolved (although we required some Amazon Support intervention for a few volumes), we have recently noticed what appear to be some vestigial issues related to the EBS outage.
The symptoms are, simply, that EC2 instances appear to be extremely slow. I/O is almost non-existent. Luckily, the fix is simple: perform a stop/start on the instance (not a restart). Your instance will be provisioned to new hardware, and you’ll have to ensure you account for a different IP address, but other than that, you’ll be back in business.
Of course- for next time- make sure that your instances are in multiple Availability Zones and Regions.
Until next time….
Many of us were affected by the Amazon EBS issues at the end of October 2012. If you had EC2 instances in us-east-1, you were likely affected by the issues. EBS volumes appeared “stuck,” snapshots would not complete, etc.
While the issues have been resolved (although we required some Amazon Support intervention for a few volumes), we have recently noticed what appear to be some vestigial issues related to the EBS outage.
The symptoms are, simply, that EC2 instances appear to be extremely slow. I/O is almost non-existent. Luckily, the fix is simple: perform a stop/start on the instance (not a restart). Your instance will be provisioned to new hardware, and you’ll have to ensure you account for a different IP address, but other than that, you’ll be back in business.
Of course- for next time- make sure that your instances are in multiple Availability Zones and Regions.
-Michael Klatsky, VP of Systems Administration

New Systems and DevOps Blog

has lots of new approaches to discuss in terms of systems, cloud computing, DevOps, System Architecture, and how developers and systems staff need to communicate well and work together

Our VP of Systems Administration, Michael Klatsky has started a blog specifically discussing Systems.  Fresh from the AWS Summit 2012 in NYC, Michael has lots of new approaches to discuss in terms of systems, cloud computing, DevOps, System Architecture, and how developers and systems staff need to communicate well and work together for the best results in web development.  The blog is his own but we feel it’s a great technical resource for our colleagues in systems and web development.  You can take a look at his blog here. Michael welcomes commentary and discussion, and hopes to provide some shortcuts for fellow System Administrators.

Cloud Platforms: The Promise vs. The Reality

Recently our VP of Search, Michael McIntosh sat down and talked to me about his thoughts on cloud computing and what businesses should be aware of when investing in the cloud.


Karen: So, how does enterprise search and cloud computing fit together?  What’s good about it for companies?

Michael: The advent of cloud computing makes it a lot easier for companies to get into search without investing a huge sum of money up front. Some of the pay-as-you-go computing approaches make it possible to do things that in the past wouldn’t have been financially viable such as natural language processing on content.  Something that could have taken days, weeks, or even months can now take much less time by throwing more hardware at a problem for a shorter time span.

For example, you could throw 20 machines at a problem for 12 hours and do a bunch of computations in a massively parallel way, and then stop it as soon as it’s done….versus the old model where you have to buy all the hardware, or rent it, and make sure it’s not underutilized so you make your investment back.

But if you need a lot of processing power for a short amount of time, it’s really quite amazing what we can do now with an approach like this.

Karen: Is this a new technology for TNR?

Michael: TNR has been using cloud computing platforms for several years now—3 or 4 years.  Cloud computing in itself is sort of a buzz word, because distributed processing and hosting has been around for a while, but the pay-as-you-go computing model is relatively new. So we have a great deal of experience with the reality of cloud computing platforms vs. the promise of cloud computing platforms.

Karen: So, what is the difference between the “promise” and the “reality” of cloud computing platforms?

Michael: Well, A lot of people think of cloud computing as this magical thing; all their problems will be solved and it will be super dependable because there are very large businesses like Amazon running the underlying infrastructure and you don’t have to worry about it.

But, as the physical infrastructure becomes easier to deploy, other critical factors come into play. You won’t have to worry about the physical logistics of getting hardware in place. But, you will have to manage multiple instances, you have to make sure that when you provision temporary processing resources, you have to remember to retire it when it’s no longer needed. Otherwise you’ll be paying more than you need to. Since virtualization uses physical hardware you do not control or maintain—there are fewer warning signs to a potential systemic failure. Now Amazon, which is the one we use the most, does a good job of backing up instances and making things available to you even when there are failures. But we’ve had problems where we’ve lost entire zones. Even if we’ve had multiple machines configured with fault tolerance, Amazon has experienced outages that have taken entire regions offline despite every conservative effort to ensure continuous up time. So we’ve had our entire service clusters go down because of problems Amazon was having. It becomes critically important for companies to develop and maintain a disaster and recovery plan. Companies need to make sure things that are critical are backed up in multiple locations. Now historically, this has been hard to do because companies typically buy enough equipment for production needs, but not enough equipment for development and staging environments.

Karen: That sounds like a costly mistake.

Michael: It can be very costly because people often develop disaster recovery plans without ongoing testing to confirm the approach continues to work. If the approach is flawed, when you do suffer an outage, you can be offline for hours, days or weeks. Even worse, you may not be able to recover your data at all.

Karen: That sounds extremely costly.

Michael: Yes, it’s no fun at all.

There are upsides though. Some pluses are that cloud computing forces you to be more formal about how you manage your technical infrastructure. For example, for training purposes; with a new developer, we can just give them a copy of a production system, and have them go to town on it, make modifications, whatever without risking the actual production servers. And if they make a mistake, which is human (you have to factor in human error), you can reprovision a brand new one, and retire the one that is fouled up. Instead of having to spend hours and hours trying to fix the problem on the machine they were working on.

Karen: This sounds like it’s a lot more flexible and time efficient, with a layer of safety built in.

Michael: Yes. Cloud computing also comes in handy if you ever have a security breach. If a hacker gets into the system and the system is compromised–if this happens, system administrators can go in and try to correct the problem. But hackers can often install backdoors to get in and out. So a cloud platform with a good disaster contingency and backup can allow system administrators to bring a whole instance down and do the patch on a whole new machine without the security breaches and patches in place. This is pretty easy to do with a cloud platform.

Karen: So TNR can help their clients do all these things?

Michael: Yes, we’ve worked with large customers over many years and we’ve seen a wide variety of things that can possibly go wrong, and we’ve been through several physical service outages both with Amazon Web Services and with Rackspace.

Cloud computing in itself is no panacea, but if you have the technical and organization proficiency to effectively leverage the platform, it can be a powerful tool used to accelerate your company’s rate of innovation.

If you are assessing the cloud as a solution in your business, contact us.  There are a variety of options for hosting that can save your company money and minimize outages. Let us show you the option that is the best fit for your organization.

Postfix SMTP AUTH w/TLS

The sysadmins at TnR Global, LLC enable email to be successfully delivered from EC2 instances, instead of being caught by Spamhaus and others.

One of the widely discussed issues with Amazon EC2 instances is the inability to reliably send email from the instances. In all too many cases, email from EC2  instances is automatically categorized as spam by the various relay databases, and by many ISP’s and carriers. There are several solutions, with the most common being a smarthost setup using either an external smarthost smtp service, such as http://authsmtp.com, or using an existing smtp server within our infrastructure. Continue reading “Postfix SMTP AUTH w/TLS”

Cloud Enabled Personalized Genomics

Focus on expertise, available, on-demand resources, and the agility to experiment with big ideas will continue to draw some personal genomics researchers to public cloud computing.

Personalized medicine is a goal of the Department of Health and Human Services. It is a driver of genomic research. It is one version of the future of medicine, using our unique genetic code toward the prevention of disease and the use of more effective or safer tailored drug therapies. Cloud computing enables access to the computational resources needed, on demand, for the data analysis needed to lay the groundwork for revolution in health care. Continue reading “Cloud Enabled Personalized Genomics”

Will Amazon AWS save me money?

One of this first questions we asked when deciding to sue Amazon’s AWS services was: Will we save money?

At first, your EC2 servers may be simply architected, perhaps small instances. Once a more serious commitment is made for a more robust architecture at Amazon, there will be additional costs introduced. For example, a small instance, running a 2 disk 100GB EBS volume w/RAID0 plus monitoring and a static IP, and the cost goes from the current ~$68 monthly per system to ~$88 monthly per system. Bump the instances to large instances (likely needed for any server running mysql, or any other equally intensive application) and that cost goes to ~$265 per instance. Add in the costs of the additional services like bandwidth, Static IP, Cloudwatch etc and the costs can quickly escalate. Of course, upfront payments for Reserved Instances can drastically reduce the costs further.

However, the savings in development and deployment costs I think far outweighs a narrower gap in the savings between physical and AWS servers, and the real MRC on the AWS servers will likely be lower for a given amount of computing resources.

So- can you save money? Yes. In some cases, it will be a direct apples to oranges savings of hard dollars. In other cases, the agility gained will provide the greatest savings. In most cases- a combination of both will drive your cost savings.

To learn more about how operating in the cloud can save your company money, contact us for a free consultation.

AMIs for Bioinformatics on AWS

Bio-Linux and other bioinformatics tools available for EC2, Amazon’s Elastic Compute Cloud, were recently highlighted on the Amazon Web Services (AWS) blog. Customized Amazon machine images (AMIs) allow for the packaging and rapid, web based deployment of the data sets and tools needed for these specialized tasks. Because AMIs can be saved, reproducing past results is simplified and because these can also be shared, the computation environment of a particular analysis can be easily replicated both from within and outside your organization.

Continue reading “AMIs for Bioinformatics on AWS”

Amazon EC2 system restore

Recently, one of our small EC2 instances failed. While we had Nagios monitoring it, Nagios only provides alerts when services fail, or when the host goes down. In this case, the failure was on Amazon’s side- the hardware where our instance resided was failing.

Recently, one of our small EC2 instances failed.  While we had Nagios monitoring it, Nagios only provides alerts when services fail, or when the host goes down. In this case, the failure was on Amazon’s side- the hardware where our instance resided was failing.

Continue reading “Amazon EC2 system restore”

Amazon EC2 ’steals’ from you

As we implement more systems in the EC2 architecture, we are noticing a not so insignificant amount cpu cycles ’stolen’. What is a ’steal’ time? It is CPU time that is taken by the Xen hypervisor for something else other than your processes- from what I have read, other people’s processes. What we need to understand is how this affects performance. Does it truly matter? We have one virtual system that consistently has steal time of between 6-12%. That would mean that 6-12% of the CPU time we pay for is being used for instances other than our own. We will have to research this more to see what the true impact is on our systems, and if there is a way to mitigate it.