Archive for the ‘System Administration’ Category

Building for Enterprise Search: A Systems View, Part 2

by Karen Lynn

When we left off, Michael Klatsky, VP of Systems Administration was telling me how important communication between the systems side and search side of is to developing an enterprise search solution. The process of building, testing, monitoring, adjusting, more testing, and more monitoring ensures systems function that way they are intended to function. Let’s resume our conversation where Michael discusses the tools he uses to ensure the system he’s building works the way the client wants it to. This is the second portion of a two part blog post.
*********************************************************************************************************************
Tools for BDD: Part 2

Karen: It’s sounding like the Search Team and Sys Admin Team need to have a good relationship and communicate often to ensure the system will accommodate the work the search team does.

Michael: Yes, search sometimes has to construct their scripts to conforms to systems. Testing is run on both sides, but small changes can affect others down the line, so it’s important to incorporate expected behaviors into modeling and monitoring on both applications and systems sides and how they interact with one another.

Karen: How do you make sure that happens?

Michael: We’re exploring some tools to help us make sure the machine will act just as we expect it to, like cucumber and cucumber nagios We’re using certain tools to facilitate the systems behaves in the way that we expect it to. We’re exploring cucumber for basic modeling and for testing. Cucumber is cool for testing because it returns values to you in colors. Red, meaning it failed, yellow meaning there’s problem, and green meaning its good. According to their docs, they instruct you to “keep running it until it’s a cucumber.”

Karen: Ah, I get it.

Michael: Right. And what cucumber nagios does is it takes cucumber and allows you to create a nagios monitoring check script. So if you pass, great, if you god red, nagios will throw an alert to the systems administrator so we have an opportunity to fix it before more is built.

Karen: Sounds like it’s an attentive way to build a system.

Michael: The only way to scale is to have machines do things for themselves. That’s the way to do it.

Karen: To automate.

Michael: Yes. Automation. Not to just set things up to automatically do configuration management beforehand, but to test afterwards to determine that your machine is behaving just as you (and your client) envisioned it.

For more information on how you can plan your enterprise search in cooperation with your systems administration team, contact us for a free consultation.

Building for Enterprise Search: A Systems View, Part 1

by Karen Lynn

I sat down with our VP of Systems Administration, Michael Klatsky to discuss some of his thoughts on how Systems Administration needs to work in concert with the Search Team to implement search technologies for clients. This is the first portion of a two part blog post.

**********************************************************************************************************************
Karen: You wanted to discuss how your approaching the systems side of search, and using a Behavior Driven Development (BDD) approach. Tell me about that.

Michael: Well, one of the problems we run into when systems brings up machines for enterprise search clusters is the search software (FAST ESP for example) is very particular about it’s environment- more so than many of the more common applications such as the Apache webserver. Properly configured DNS, specific environment variables, specific library versions have to be present. There are ownership and permissions that need to be in place, and performance metrics that must adhere to a given baseline. There can be slow disks can affect performance. There has to be the right amount of memory, and different classifications of systems roles. Currently, we have homegrown scripts that bring up systems, then we have other scripts we run to detect issues. These scripts will tell us if the system is ready for what we need it to do. We also monitor the systems for standard items such as diskspace, memory usage, as well as basic search functionality. For example we’ll run a quick search on say paper clips, and if comes back with results we know it’s running.

That’s what we’ve done historically. But now, we need to bring up larger numbers of machines,and have confidence that they will perform exactly as we expect. Additionally, we have a set of functional tasks that must be available without fail As we bring up clusters of larger numbers of machines, and as we need to be more nimble, how can we ensure that it will respond the way we expect it to?

Karen: This is where Behavior Driven Development comes in, right?

Michael: Right. There is a lot of discussion out there on Behavior Driven Development which would include behavior driven modelling, behavior driven monitoring, behavior driven architecture and infrastructure. So not only does a machine come up and is listening on these ports, but I can bring a machine up, I can go to that machine and I’m able to log in, install certain software, and peform tasks. I can go to another machine and perform a task. So, the question is, how do you model that? How do we ensure the system will behaves as it should?

Karen: So you’re looking at replicating the behavior of these systems so that every time we deploy something it will be the same way.

Michael: Right. And if a change is made, even a small change, we’ll see it right away because a system or service will fail and be able to fix it. Sometimes a service will fail silently. But we test and monitor constantly to ensure the system will do exactly what we expect it to do. It’s all a part of the build process.

Karen: Sounds like a smart approach.

Michael: Yes. And if we make a change, we’ll find out how that change will affect the rest of the system. For instance, we run tests and if something is wrong it should give you an error. For example if you change the location of your SSH keys. You may still be able to get into the machine by SSH, but one little change could make it impossible to SSH from one machine to another in the cluster. So rather than find that out after you begin your manual work on that, we make it part of the build process by constantly monitoring and testing the system as we build it.

Karen: It sounds like building a house and then realizing you have bricks out of place after it’s built.

Michael: Worse, it’s like building a house and realizing you forgot to build a door! At the very least while you are building, you can test, and let me know, “Hey! I don’t have a door to my house!” So that I can fix it before you move in.

There are certain things the search team needs to do to ensure their work will function in the system, like SSHing around the machines in the cluster–they need to be able to do that. There are certain ports that system need to be listening on, there are certain services that need to return a normal range of results. We need to define what a proper operation looks like. We can’t necessarily say that if we search for gold plated paperclips for example, that the search result should show 1000 results every time, that may or may not be the case–we don’t necessarily know if this is a proper result every time, but we should determine if the result returned is within a proper range of normal.

We’re defining what a proper operation looks like and ensure it functions that way. Part of the behavior driven model which is what I’m really interested in, we can set up a natural language looking config file. This config file should describe the actions or behaviors I expect. For example, when I go to ABC.com website and search for gold plated paperclips, I expect to see results. One result should be X. There should be more than Y results. When I return that result, I should be able to click on one result and go to that products feature list. Basically I’m describing how the customer will interact with the search, what I expect the customer to do, and design the system to respond with the customer’s actions in mind.

Karen: So your engineering it with the customer’s behaviors in mind.

Michael: That’s exactly what we’re doing. Then that if I look for a certain item, I get that result, describe the behavior of what the customer should do and make the system behave in cooperation with the customer behavior. We need to determine what right looks like, and have the system behave that way.

Karen: And what right looks like is really different for each client.

Micheal: Yes. You can write in somewhat natural English what that looks like. It’s not magic, but you still have to come up with specification of what right looks like. But you can do a lot of sophisticated things in this manner because you will know you’ll have a website that’s going to perform the way it’s suppose to perform. The bottom line is: Define what your systems should “look” like, deploy those systems using those definitions, and after deployment, test to ensure that those systems “look” like your definition.

For more information on how you can plan your enterprise search in cooperation with your systems administration team, contact us for a free consultation.

Cloud Platforms: The Promise vs. The Reality

by Karen Lynn

Recently our VP of Search, Michael McIntosh sat down and talked to me about his thoughts on cloud computing and what businesses should be aware of when investing in the cloud.


Karen: So, how does enterprise search and cloud computing fit together?  What’s good about it for companies?

Michael: The advent of cloud computing makes it a lot easier for companies to get into search without investing a huge sum of money up front. Some of the pay-as-you-go computing approaches make it possible to do things that in the past wouldn’t have been financially viable such as natural language processing on content.  Something that could have taken days, weeks, or even months can now take much less time by throwing more hardware at a problem for a shorter time span.

For example, you could throw 20 machines at a problem for 12 hours and do a bunch of computations in a massively parallel way, and then stop it as soon as it’s done….versus the old model where you have to buy all the hardware, or rent it, and make sure it’s not underutilized so you make your investment back.

But if you need a lot of processing power for a short amount of time, it’s really quite amazing what we can do now with an approach like this.

Karen: Is this a new technology for TNR?

Michael: TNR has been using cloud computing platforms for several years now—3 or 4 years.  Cloud computing in itself is sort of a buzz word, because distributed processing and hosting has been around for a while, but the pay-as-you-go computing model is relatively new. So we have a great deal of experience with the reality of cloud computing platforms vs. the promise of cloud computing platforms.

Karen: So, what is the difference between the “promise” and the “reality” of cloud computing platforms?

Michael: Well, A lot of people think of cloud computing as this magical thing; all their problems will be solved and it will be super dependable because there are very large businesses like Amazon running the underlying infrastructure and you don’t have to worry about it.

But, as the physical infrastructure becomes easier to deploy, other critical factors come into play. You won’t have to worry about the physical logistics of getting hardware in place. But, you will have to manage multiple instances, you have to make sure that when you provision temporary processing resources, you have to remember to retire it when it’s no longer needed. Otherwise you’ll be paying more than you need to. Since virtualization uses physical hardware you do not control or maintain—there are fewer warning signs to a potential systemic failure. Now Amazon, which is the one we use the most, does a good job of backing up instances and making things available to you even when there are failures. But we’ve had problems where we’ve lost entire zones. Even if we’ve had multiple machines configured with fault tolerance, Amazon has experienced outages that have taken entire regions offline despite every conservative effort to ensure continuous up time. So we’ve had our entire service clusters go down because of problems Amazon was having. It becomes critically important for companies to develop and maintain a disaster and recovery plan. Companies need to make sure things that are critical are backed up in multiple locations. Now historically, this has been hard to do because companies typically buy enough equipment for production needs, but not enough equipment for development and staging environments.

Karen: That sounds like a costly mistake.

Michael: It can be very costly because people often develop disaster recovery plans without ongoing testing to confirm the approach continues to work. If the approach is flawed, when you do suffer an outage, you can be offline for hours, days or weeks. Even worse, you may not be able to recover your data at all.

Karen: That sounds extremely costly.

Michael: Yes, it’s no fun at all.

There are upsides though. Some pluses are that cloud computing forces you to be more formal about how you manage your technical infrastructure. For example, for training purposes; with a new developer, we can just give them a copy of a production system, and have them go to town on it, make modifications, whatever without risking the actual production servers. And if they make a mistake, which is human (you have to factor in human error), you can reprovision a brand new one, and retire the one that is fouled up. Instead of having to spend hours and hours trying to fix the problem on the machine they were working on.

Karen: This sounds like it’s a lot more flexible and time efficient, with a layer of safety built in.

Michael: Yes. Cloud computing also comes in handy if you ever have a security breach. If a hacker gets into the system and the system is compromised–if this happens, system administrators can go in and try to correct the problem. But hackers can often install backdoors to get in and out. So a cloud platform with a good disaster contingency and backup can allow system administrators to bring a whole instance down and do the patch on a whole new machine without the security breaches and patches in place. This is pretty easy to do with a cloud platform.

Karen: So TNR can help their clients do all these things?

Michael: Yes, we’ve worked with large customers over many years and we’ve seen a wide variety of things that can possibly go wrong, and we’ve been through several physical service outages both with Amazon Web Services and with Rackspace.

Cloud computing in itself is no panacea, but if you have the technical and organization proficiency to effectively leverage the platform, it can be a powerful tool used to accelerate your company’s rate of innovation.

If you are assessing the cloud as a solution in your business, contact us.  There are a variety of options for hosting that can save your company money and minimize outages. Let us show you the option that is the best fit for your organization.

A better way to add or update MySQL rows

by admin

Recently, we needed to iterate over a fairly large data set (on the order of millions) and do the ever-common If it’s not in the database, put it in.  If it’s already there, just update some fields. It’s a pattern that is very common for things like log files (where, for example, only a timestamp needs to be updated in some cases).

The obvious way of doing a SELECT, followed by either an UPDATE or an INSERT is too slow for even moderately-large datasets.  The better way to accomplish this is to use MySQL’s ON DUPLICATE KEY UPDATE directive.  By simply creating a unique key on the fields that should be different per-row, this syntax provides two specific benefits:

  • Allows batch (read: transaction) queries for large data
  • Increases performance overall versus making two separate queries

These benefits are especially helpful when your dataset is too large to fit into memory.  The obvious drawback to this method, however, is that it may put additional load on your database server.  Like anything else, it’s worth testing out your individual situation but, for us, ON DUPLICATE KEY UPDATE was the way to go.

Leveraging your assets: Repurposing a physical server as an OpenVZ virtual server

by Michael Klatsky

As an active system administrator, part of my job is determining which systems require decommissioning due to the age of the OS or for other reasons. When readying a server for retirement, we’ll take the opportunity to move and upgrade the services running on that server. Often, we are then left with a perfectly good piece of hardware that has already been paid for and is still a valuable asset. A great way to leverage this equipment is virtualization- specifically, OpenVZ.

Oftentimes, servers are underutilized- especially when it comes to development work, or when running lower impact applications. Rather than deal with multiple users on a system, apache virtual hosts for multiple websites, worrying about secure file access, or one user or customer hogging a huge amount of resources, we have found that creating multiple virtual servers using OpenVZ is an ideal solution. I won’t delve into OpenVZ deployment other than to briefly note that on our CentOS and RedHat servers, installation is as simple as adding the correct repository and installing via yum (see here for more info). Once installed, a quick reboot into the new kernel and you are ready to roll. We are running 45-50 virtual servers (VEs, or containers) on one of our 2 QUAD core CPU, 8GB RAM servers, with plenty of room to spare. I recommend running ‘vzsplit’ to generate a good configuration basis for you VEs.

Once we have installed and configured OpenVZ on our new server, we are then able to deploy a large number of VEs for individual users or customers. Each VE provides the user with the ability to have root access, update and install their own software, deploy their own applications, etc. To the user- they are on their own complete system. Should their application misbehave, it won’t affect the others on the system.

Additionally, many resources can be adjusted on the fly. Running out of disk space? Increase it on the fly. Need more memory? Increase on the fly. Live resource management such as this is a very powerful way to leverage your hardware.

We currently are using OpenVZ for CMS development, custom programming development, building custom rpms, running websites, and various other testing where we need easily deployed servers which may or may not be needed of extended periods of time.

Virtualizing our own equipment in this way makes great economic sense for several reasons-. We are using a server which we already own, thus helping us increase our “green” sensibilities by keeping this system out of the landfill. We eliminate the need for more servers for development work. We can even host paying customers, thus deriving income from the hardware. Using our own equipment also helps us keep costs lower by lessening the need to move data and applications offsite to providers such as Slicehost.  Slicehost has it’s place- and, in fact, we use them for certain applications- but they do not provide the versatility necessary for much of our development work.
In summary, by leveraging existing, underutilized  or potentially retired hardware, you can save money in reduced additional hardware costs, derive income and help the environment. Additionally, the agility in development and deployment that we gain simply adds another layer to the economic advantages that we gain.That sounds like a good plan to me!

MySQL Error: BLOB/TEXT used in key specification without a key length

by admin

Recently, I was populating a database with lines from a number of log files.  One of the key pieces of information in each of these log lines was a URL.  Because URLs can be pretty much as long as they want to be (or can they?) I decided to make the URL field a Text type in my schema.  Then, because I wanted fast lookups, I tried to add an index (key) on this field and ran into this guy:

ERROR 1170 (42000): BLOB/TEXT column ‘field_name’ used in key specification
without a key length

It turns out that MySQL can only index the first N characters of a Blob or Text field – but for a URL, that’s not good enough.  After talking it over with my team members, we decided to instead add a field – url_md5.  By storing the md5sum of each URL, we could index on the hash field and have both fast lookups and avoid worrying about domains like this fitting into a VARCHAR.

Amazon EC2 system restore

by Michael Klatsky

Recently, one of our small EC2 instances failed.  While we had Nagios monitoring it, Nagios only provides alerts when services fail, or when the host goes down. In this case, the failure was on Amazon’s side- the hardware where our instance resided was failing.

Read the rest of this entry »

Amazon EC2 ’steals’ from you

by Michael Klatsky

As we implement more systems in the EC2 architecture, we are noticing a not so insignificant amount cpu cycles ’stolen’. What is a ’steal’ time? It is CPU time that is taken by the Xen hypervisor for something else other than your processes- from what I have read, other people’s processes. What we need to understand is how this affects performance. Does it truly matter? We have one virtual system that consistently has steal time of between 6-12%. That would mean that 6-12% of the CPU time we pay for is being used for instances other than our own. We will have to research this more to see what the true impact is on our systems, and if there is a way to mitigate it.

Per-directory access restrictions with Trac and Subversion (mod_svn)

by admin

The below assumes that you are requiring a login to access Subversion over http(s) and Trac, and that the credentials users use for each service are the same.

Your access control file:

Both Trac and mod_svn use the same access control file format, so you should only need one file for both. The file format is described in some detail here, but I’ll go over some basics. Read the rest of this entry »

Using the Data::Dumper module to debug your Perl scripts

by admin

If you’re ever writing a Perl script that contains fairly complicated data structures (multi-dimensional arrays, hashes of arrays, etc), or you’re debugging code that contains complicated objects, it may be helpful to view the current state of those data structures in their entirety.

This is easy to do with Data::Dumper. Although the features the module provides are fairly complex, it’s easy to use it to just print out the contents of something. Do do this, you’ll first need to import the module with ‘use Data::Dumper;’. Then, to dump the data in an array(for example), simply do ‘print Dumper(@array)’. Read the rest of this entry »