New Systems and DevOps Blog

has lots of new approaches to discuss in terms of systems, cloud computing, DevOps, System Architecture, and how developers and systems staff need to communicate well and work together

Our VP of Systems Administration, Michael Klatsky has started a blog specifically discussing Systems.  Fresh from the AWS Summit 2012 in NYC, Michael has lots of new approaches to discuss in terms of systems, cloud computing, DevOps, System Architecture, and how developers and systems staff need to communicate well and work together for the best results in web development.  The blog is his own but we feel it’s a great technical resource for our colleagues in systems and web development.  You can take a look at his blog here. Michael welcomes commentary and discussion, and hopes to provide some shortcuts for fellow System Administrators.

Cloud Platforms: The Promise vs. The Reality

Recently our VP of Search, Michael McIntosh sat down and talked to me about his thoughts on cloud computing and what businesses should be aware of when investing in the cloud.

Karen: So, how does enterprise search and cloud computing fit together?  What’s good about it for companies?

Michael: The advent of cloud computing makes it a lot easier for companies to get into search without investing a huge sum of money up front. Some of the pay-as-you-go computing approaches make it possible to do things that in the past wouldn’t have been financially viable such as natural language processing on content.  Something that could have taken days, weeks, or even months can now take much less time by throwing more hardware at a problem for a shorter time span.

For example, you could throw 20 machines at a problem for 12 hours and do a bunch of computations in a massively parallel way, and then stop it as soon as it’s done….versus the old model where you have to buy all the hardware, or rent it, and make sure it’s not underutilized so you make your investment back.

But if you need a lot of processing power for a short amount of time, it’s really quite amazing what we can do now with an approach like this.

Karen: Is this a new technology for TNR?

Michael: TNR has been using cloud computing platforms for several years now—3 or 4 years.  Cloud computing in itself is sort of a buzz word, because distributed processing and hosting has been around for a while, but the pay-as-you-go computing model is relatively new. So we have a great deal of experience with the reality of cloud computing platforms vs. the promise of cloud computing platforms.

Karen: So, what is the difference between the “promise” and the “reality” of cloud computing platforms?

Michael: Well, A lot of people think of cloud computing as this magical thing; all their problems will be solved and it will be super dependable because there are very large businesses like Amazon running the underlying infrastructure and you don’t have to worry about it.

But, as the physical infrastructure becomes easier to deploy, other critical factors come into play. You won’t have to worry about the physical logistics of getting hardware in place. But, you will have to manage multiple instances, you have to make sure that when you provision temporary processing resources, you have to remember to retire it when it’s no longer needed. Otherwise you’ll be paying more than you need to. Since virtualization uses physical hardware you do not control or maintain—there are fewer warning signs to a potential systemic failure. Now Amazon, which is the one we use the most, does a good job of backing up instances and making things available to you even when there are failures. But we’ve had problems where we’ve lost entire zones. Even if we’ve had multiple machines configured with fault tolerance, Amazon has experienced outages that have taken entire regions offline despite every conservative effort to ensure continuous up time. So we’ve had our entire service clusters go down because of problems Amazon was having. It becomes critically important for companies to develop and maintain a disaster and recovery plan. Companies need to make sure things that are critical are backed up in multiple locations. Now historically, this has been hard to do because companies typically buy enough equipment for production needs, but not enough equipment for development and staging environments.

Karen: That sounds like a costly mistake.

Michael: It can be very costly because people often develop disaster recovery plans without ongoing testing to confirm the approach continues to work. If the approach is flawed, when you do suffer an outage, you can be offline for hours, days or weeks. Even worse, you may not be able to recover your data at all.

Karen: That sounds extremely costly.

Michael: Yes, it’s no fun at all.

There are upsides though. Some pluses are that cloud computing forces you to be more formal about how you manage your technical infrastructure. For example, for training purposes; with a new developer, we can just give them a copy of a production system, and have them go to town on it, make modifications, whatever without risking the actual production servers. And if they make a mistake, which is human (you have to factor in human error), you can reprovision a brand new one, and retire the one that is fouled up. Instead of having to spend hours and hours trying to fix the problem on the machine they were working on.

Karen: This sounds like it’s a lot more flexible and time efficient, with a layer of safety built in.

Michael: Yes. Cloud computing also comes in handy if you ever have a security breach. If a hacker gets into the system and the system is compromised–if this happens, system administrators can go in and try to correct the problem. But hackers can often install backdoors to get in and out. So a cloud platform with a good disaster contingency and backup can allow system administrators to bring a whole instance down and do the patch on a whole new machine without the security breaches and patches in place. This is pretty easy to do with a cloud platform.

Karen: So TNR can help their clients do all these things?

Michael: Yes, we’ve worked with large customers over many years and we’ve seen a wide variety of things that can possibly go wrong, and we’ve been through several physical service outages both with Amazon Web Services and with Rackspace.

Cloud computing in itself is no panacea, but if you have the technical and organization proficiency to effectively leverage the platform, it can be a powerful tool used to accelerate your company’s rate of innovation.

If you are assessing the cloud as a solution in your business, contact us.  There are a variety of options for hosting that can save your company money and minimize outages. Let us show you the option that is the best fit for your organization.

Leveraging your assets: Repurposing a physical server as an OpenVZ virtual server

As an active system administrator, part of my job is determining which systems require decommissioning due to the age of the OS or for other reasons. When readying a server for retirement, we’ll take the opportunity to move and upgrade the services running on that server. Often, we are then left with a perfectly good piece of hardware that has already been paid for and is still a valuable asset. A great way to leverage this equipment is virtualization- specifically, OpenVZ.

Oftentimes, servers are underutilized- especially when it comes to development work, or when running lower impact applications. Rather than deal with multiple users on a system, apache virtual hosts for multiple websites, worrying about secure file access, or one user or customer hogging a huge amount of resources, we have found that creating multiple virtual servers using OpenVZ is an ideal solution. I won’t delve into OpenVZ deployment other than to briefly note that on our CentOS and RedHat servers, installation is as simple as adding the correct repository and installing via yum (see here for more info). Once installed, a quick reboot into the new kernel and you are ready to roll. We are running 45-50 virtual servers (VEs, or containers) on one of our 2 QUAD core CPU, 8GB RAM servers, with plenty of room to spare. I recommend running ‘vzsplit’ to generate a good configuration basis for you VEs.

Once we have installed and configured OpenVZ on our new server, we are then able to deploy a large number of VEs for individual users or customers. Each VE provides the user with the ability to have root access, update and install their own software, deploy their own applications, etc. To the user- they are on their own complete system. Should their application misbehave, it won’t affect the others on the system.

Additionally, many resources can be adjusted on the fly. Running out of disk space? Increase it on the fly. Need more memory? Increase on the fly. Live resource management such as this is a very powerful way to leverage your hardware.

We currently are using OpenVZ for CMS development, custom programming development, building custom rpms, running websites, and various other testing where we need easily deployed servers which may or may not be needed of extended periods of time.

Virtualizing our own equipment in this way makes great economic sense for several reasons-. We are using a server which we already own, thus helping us increase our “green” sensibilities by keeping this system out of the landfill. We eliminate the need for more servers for development work. We can even host paying customers, thus deriving income from the hardware. Using our own equipment also helps us keep costs lower by lessening the need to move data and applications offsite to providers such as Slicehost.  Slicehost has it’s place- and, in fact, we use them for certain applications- but they do not provide the versatility necessary for much of our development work.
In summary, by leveraging existing, underutilized  or potentially retired hardware, you can save money in reduced additional hardware costs, derive income and help the environment. Additionally, the agility in development and deployment that we gain simply adds another layer to the economic advantages that we gain.That sounds like a good plan to me!

Amazon EC2 system restore

Recently, one of our small EC2 instances failed. While we had Nagios monitoring it, Nagios only provides alerts when services fail, or when the host goes down. In this case, the failure was on Amazon’s side- the hardware where our instance resided was failing.

Recently, one of our small EC2 instances failed.  While we had Nagios monitoring it, Nagios only provides alerts when services fail, or when the host goes down. In this case, the failure was on Amazon’s side- the hardware where our instance resided was failing.

Continue reading “Amazon EC2 system restore”

Migrating an OpenVZ Virtual Machine

One of the great features of OpenVZ is the ability to easily migrate a virtual machine(VM) to another server. While identifying the best methods to perform this task recently, I read about two tools to accomplish this task: vzdump, and vzmigrate. Continue reading “Migrating an OpenVZ Virtual Machine”