News blog – Page 4

Fast ESP Error: no doc procs registered to process a batch with priority 0

Just wanted to take this error message off of the, “Hey, we’ve seen this before… now how did we resolve this..?” pile. This is the full text of the error:

WARNING    Could not send batch to ESP content distributor, will retry automatically.
Reason given: process() failed: exception (no_resources) no doc procs registered to 
process a batch with priority 0

At first glance, it looks pretty clear that you just need to [re]start your document processor(s). However, this won’t necessarily solve the problem. Turns out that the a likely reason for this to pop up is a bad Document Processing Pipeline (DPP) Stage. The docprocs fire up, hit the bad stage (e.g. python errors etc.) and don’t recover.

To debug your DPP Stage, take a look at the logs for the document processor(s). They’re usually located in $FASTSEARCH/var/log/procserver and, in our experience, there’s probably an uncaught python exception lurking somewhere in there.

Integrate custom services with the Fast ESP Node Controller

Add your own services to ESP's Node Controller

Background

Integrating our own custom services with Fast ESP’s Node Controller provides us with several benefits:

Administrators without in-depth ESP knowledge can easily control services (e.g. start, stop, configure parameters)
Services can be started at boot time with the rest of ESP
espdeploy can be used to install our services in a multi-node cluster

The components required for this system are:

The ESP Node Controller (with config file NodeConf.xml)
The 3rd-party service (like a CherryPy server, log parser, etc.)
A wrapper script (see below)

Steps for Integration

Define the service you would like to integrate. It can be any script or binary that can be executed on the system. For example, the service might be a python script that takes command-line arguments and continues running itself (as is the case with a webserver).
Create the wrapper script that sets up the proper environment and runs/stops the service properly. The wrapper script should be put in the $FASTSEARCH/bin directory (with executable permissions). Additionally, the wrapper script should pass $@ to your actual script so any/all arguments defined in $FASTSEARCH/etc/NodeConf.xml will be passed along properly from the Node Controller to your service. The following is an example of a wrapper script:
```
#!/bin/sh

# export the proper python path
export PYTHONPATH=":/path/to/python"

# run the script (backgrounded)
python $FASTSEARCH/lib/python2.6/yourmodule/yourservice.py $@ &

# determine the process id of the python script
SCRIPT_PID="$!"

# upon receiving a SIGTERM, forward it to the process
trap "kill -TERM $SCRIPT_PID" SIGTERM

# wait for SIGTERM from nctrl
wait
```

Define the service in $FASTSEARCH/etc/NodeConf.xml
Add the following to the end of the <startorder> tag:

<proc>servicename</proc>

Add the following to the end of the <node> tag, customizing as appropriate:

<!-- My Custom Service -->
<process name="servicename" description="My Custom Service">
        <start>
                <executable>binaryname</executable>
                <parameters>-p 16940 -v</parameters>
                <port base="4004"/>
        </start>
        <outfile>servicename.scrap</outfile>
</process>

Reload the Node Controller configuration with the following:
```
nctrl reloadcfg
```

And that’s it! Now you should be able to start, stop, configure, and deploy your services using Fast ESP tools. Enjoy!

Cloud Enabled Personalized Genomics

Focus on expertise, available, on-demand resources, and the agility to experiment with big ideas will continue to draw some personal genomics researchers to public cloud computing.

Personalized medicine is a goal of the Department of Health and Human Services. It is a driver of genomic research. It is one version of the future of medicine, using our unique genetic code toward the prevention of disease and the use of more effective or safer tailored drug therapies. Cloud computing enables access to the computational resources needed, on demand, for the data analysis needed to lay the groundwork for revolution in health care. Continue reading “Cloud Enabled Personalized Genomics”

Will Amazon AWS save me money?

One of this first questions we asked when deciding to sue Amazon’s AWS services was: Will we save money?

At first, your EC2 servers may be simply architected, perhaps small instances. Once a more serious commitment is made for a more robust architecture at Amazon, there will be additional costs introduced. For example, a small instance, running a 2 disk 100GB EBS volume w/RAID0 plus monitoring and a static IP, and the cost goes from the current ~$68 monthly per system to ~$88 monthly per system. Bump the instances to large instances (likely needed for any server running mysql, or any other equally intensive application) and that cost goes to ~$265 per instance. Add in the costs of the additional services like bandwidth, Static IP, Cloudwatch etc and the costs can quickly escalate. Of course, upfront payments for Reserved Instances can drastically reduce the costs further.

However, the savings in development and deployment costs I think far outweighs a narrower gap in the savings between physical and AWS servers, and the real MRC on the AWS servers will likely be lower for a given amount of computing resources.

So- can you save money? Yes. In some cases, it will be a direct apples to oranges savings of hard dollars. In other cases, the agility gained will provide the greatest savings. In most cases- a combination of both will drive your cost savings.

To learn more about how operating in the cloud can save your company money, contact us for a free consultation.

A better way to add or update MySQL rows

Recently, we needed to iterate over a fairly large data set (on the order of millions) and do the ever-common If it’s not in the database, put it in. If it’s already there, just update some fields. It’s a pattern that is very common for things like log files (where, for example, only a timestamp needs to be updated in some cases).

The obvious way of doing a SELECT, followed by either an UPDATE or an INSERT is too slow for even moderately-large datasets. The better way to accomplish this is to use MySQL’s ON DUPLICATE KEY UPDATE directive. By simply creating a unique key on the fields that should be different per-row, this syntax provides two specific benefits:

Allows batch (read: transaction) queries for large data
Increases performance overall versus making two separate queries

These benefits are especially helpful when your dataset is too large to fit into memory. The obvious drawback to this method, however, is that it may put additional load on your database server. Like anything else, it’s worth testing out your individual situation but, for us, ON DUPLICATE KEY UPDATE was the way to go.

Leveraging your assets: Repurposing a physical server as an OpenVZ virtual server

As an active system administrator, part of my job is determining which systems require decommissioning due to the age of the OS or for other reasons. When readying a server for retirement, we’ll take the opportunity to move and upgrade the services running on that server. Often, we are then left with a perfectly good piece of hardware that has already been paid for and is still a valuable asset. A great way to leverage this equipment is virtualization- specifically, OpenVZ.

Oftentimes, servers are underutilized- especially when it comes to development work, or when running lower impact applications. Rather than deal with multiple users on a system, apache virtual hosts for multiple websites, worrying about secure file access, or one user or customer hogging a huge amount of resources, we have found that creating multiple virtual servers using OpenVZ is an ideal solution. I won’t delve into OpenVZ deployment other than to briefly note that on our CentOS and RedHat servers, installation is as simple as adding the correct repository and installing via yum (see here for more info). Once installed, a quick reboot into the new kernel and you are ready to roll. We are running 45-50 virtual servers (VEs, or containers) on one of our 2 QUAD core CPU, 8GB RAM servers, with plenty of room to spare. I recommend running ‘vzsplit’ to generate a good configuration basis for you VEs.

Once we have installed and configured OpenVZ on our new server, we are then able to deploy a large number of VEs for individual users or customers. Each VE provides the user with the ability to have root access, update and install their own software, deploy their own applications, etc. To the user- they are on their own complete system. Should their application misbehave, it won’t affect the others on the system.

Additionally, many resources can be adjusted on the fly. Running out of disk space? Increase it on the fly. Need more memory? Increase on the fly. Live resource management such as this is a very powerful way to leverage your hardware.

We currently are using OpenVZ for CMS development, custom programming development, building custom rpms, running websites, and various other testing where we need easily deployed servers which may or may not be needed of extended periods of time.

Virtualizing our own equipment in this way makes great economic sense for several reasons-. We are using a server which we already own, thus helping us increase our “green” sensibilities by keeping this system out of the landfill. We eliminate the need for more servers for development work. We can even host paying customers, thus deriving income from the hardware. Using our own equipment also helps us keep costs lower by lessening the need to move data and applications offsite to providers such as Slicehost. Slicehost has it’s place- and, in fact, we use them for certain applications- but they do not provide the versatility necessary for much of our development work.
In summary, by leveraging existing, underutilized or potentially retired hardware, you can save money in reduced additional hardware costs, derive income and help the environment. Additionally, the agility in development and deployment that we gain simply adds another layer to the economic advantages that we gain.That sounds like a good plan to me!

Sidekick Danger – It’s not the cloud, it’s the approach

I recently saw the headline,”T-Mobile and Microsoft/Danger data loss is bad for the cloud“, and, as an admin who works with cloud technology on a daily basis, viewed the headline with some concern. However, after reading the article itself, my only thought is “What does this have to do with the cloud?”. Reading through, we find that Microsoft/Danger stores your phone data (contacts, photos, etc) on it’s servers, and that the phone needs to constantly be in contact with the servers in order to maintain service and data. Unfortunately, the servers crashed, and all of the data was lost. Turn off your phone, lose all your data. Yet- this is exactly what the Sidekick service promises to protect you from- and it failed.

The problem with blaming this on the “cloud” is that, while technically, your cell phone and the Microsoft/Danger servers form a “cloud”, the failure lies with the servers, and those who administer those servers. It doesn’t matter whether those servers are virtual, or physical- if there is not a disaster recovery plan in place, and if that disaster recovery plan has not been tested- data will be lost. Your data. This is not a shortcoming of cloud computing- it is a result of depending on others to maintain your data. It also gives us caution when depending on external providers over the network to always be available. Services stop. Power fails. Disks die. Routing interruptions happen.

This is just network computing. But if the people (or companies) behind it all don’t do their own due diligence- disasters like the this, and worse will continue to happen.

Perl script documentation with Pod::Usage

One of the most important parts of maintainable and usable system administration scripts is documentation. Code comments are a key component to this, but so is usage documentation. In this post, I’ll address an easy way to add complete and useful usage documentation to your scripts using the module Pod::Usage.

Pod::Usage is a module that lets you easily convert Pod documentation to a help message or man page. To use it, you just need to include several Pod sections in your documentation. These include the NAME, SYNOPSIS, OPTIONS, and DESCRIPTION sections. Pod::Usage will use some of these sections to generate usage information and man pages. Once you’ve written all of your documentation, you can use Getopt::Long to capture options passed to the script like –help and –man. For these options, you can run the function ‘pod2usage’, with varying levels of verbosity.

For example, ‘pod2usage(-verbose => 1)’ will print out a short usage message (generated from the SYNOPSIS Pod section). To open a full man page in your default man viewer, you can use a verbosity of 2. Additionally, you can print out a string before your generated usage message, by providing the ‘-msg’ option inside the function call. For example, ‘pod2usage(-msg => “Not enough arguments”, -verbose => 1)’ will print “Not enough arguments”, followed by the usage message for your script.

You can find documentation on the Pod::Usage module as well as example code on perldoc.perl.org.

AMIs for Bioinformatics on AWS

Bio-Linux and other bioinformatics tools available for EC2, Amazon’s Elastic Compute Cloud, were recently highlighted on the Amazon Web Services (AWS) blog. Customized Amazon machine images (AMIs) allow for the packaging and rapid, web based deployment of the data sets and tools needed for these specialized tasks. Because AMIs can be saved, reproducing past results is simplified and because these can also be shared, the computation environment of a particular analysis can be easily replicated both from within and outside your organization.

Continue reading “AMIs for Bioinformatics on AWS”