How to create a duplicate ESP collection without re-crawling!

In a production (or even stable) ESP environment, it is difficult to make a change to the Document Processing Pipeline and test it without wiping out the existing collection (not to mention the time it takes to perform a full re-crawl if the collection is even moderately large). In this case, the best option is to use postprocess to feed existing documents to a new (empty) collection.

Making a duplicate collection provides several benefits:

  • No re-crawling is required
  • The original collection is not affected by pipeline changes
  • You can test your new collection without touching the stable data
  • Upon determining that your changes are producing good results, you can easily migrate your front-end to the new collection while still maintaining existing stable data in the original collection (in case you want to revert your changes)

Steps to make a duplicate collection

  1. Using the ESP Admin GUI, create a new collection with the pipeline you would like to use (or test, as the case may be)
  2. Do not specify any data sources when configuring the new collection
  3. Stop the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl stop crawler

  4. Run the following command where origcollection is the original collection and newcollection is the new collection (that you just created):

    $FASTSEARCH/bin/postprocess -R origcollection -k default:newcollection

    Notes about this command:

    • the default specified above is a content feeding destination, as specified in the destinations section of $FASTSEARCH/etc/CrawlerGlobalDefaults.xml. Specifying default will specify the destination as the current ESP install.
    • be sure to run the above command using either nohup or screen as it will not exit until all content has been fed to the new collection. For large collections this may take a while.
  5. Restart the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl start crawler

Fast ESP Error: no doc procs registered to process a batch with priority 0

Just wanted to take this error message off of the, “Hey, we’ve seen this before… now how did we resolve this..?” pile.  This is the full text of the error:

WARNING    Could not send batch to ESP content distributor, will retry automatically.
Reason given: process() failed: exception (no_resources) no doc procs registered to 
process a batch with priority 0

At first glance, it looks pretty clear that you just need to [re]start your document processor(s).  However, this won’t necessarily solve the problem.  Turns out that the a likely reason for this to pop up is a bad Document Processing Pipeline (DPP) Stage.  The docprocs fire up, hit the bad stage (e.g. python errors etc.) and don’t recover.

To debug your DPP Stage, take a look at the logs for the document processor(s).  They’re usually located in $FASTSEARCH/var/log/procserver and, in our experience, there’s probably an uncaught python exception lurking somewhere in there.

TNR Global is co-organizing CloudCamp Western Massachusetts

 width= TNR Global is co-organizing and sponsoring  CloudCamp Western Massachusetts that will take place on  April 20, 2010, 2:30pm-7pm, at the National Science Foundation funded National Center for Information and Communications Technologies (ICT Center) at Springfield Technical Community College (1 Federal St, ICT Center, STCC, Springfield, MA 01105).   This event is co-organized by CloudCamp co-founder Dave Nielsen and the ICT Center.

Developers, decision makers, end users, and vendors from MA, CT, VT, and surrounding states are invited to participate: attend, present, and/or sponsor.   CloudCamp Western Massachusetts will provide a central point for bringing together local academia and businesses. The ICT Center will stream live video of the event to other technology community colleges around the nation.   http://www.cloudcamp.org/westernmass

For more info, please contact Natasha Goncharova at corp@tnrglobal.com

Presentation – Search Engine Integration, Deployment and Scaling in the Cloud

Michael McIntosh, Senior Search Architect, TNR Global, will be presenting at the NYC Search and Discovery Meetup on January 20, 2010, 6:30pm.  The topic of his presentation is Search Engine Integration, Deployment and Scaling in the Cloud. Michael will share his hands-on experience with implementing an enterprise search system in the cloud.   Come join!

http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/

Integrate custom services with the Fast ESP Node Controller

Add your own services to ESP's Node Controller
Add your own services to ESP's Node Controller

Background

Integrating our own custom services with Fast ESP’s Node Controller provides us with several benefits:

  • Administrators without in-depth ESP knowledge can easily control services (e.g. start, stop, configure parameters)
  • Services can be started at boot time with the rest of ESP
  • espdeploy can be used to install our services in a multi-node cluster

The components required for this system are:

  1. The ESP Node Controller (with config file NodeConf.xml)
  2. The 3rd-party service (like a CherryPy server, log parser, etc.)
  3. A wrapper script (see below)

Steps for Integration

  1. Define the service you would like to integrate. It can be any script or binary that can be executed on the system. For example, the service might be a python script that takes command-line arguments and continues running itself (as is the case with a webserver).
  2. Create the wrapper script that sets up the proper environment and runs/stops the service properly. The wrapper script should be put in the $FASTSEARCH/bin directory (with executable permissions). Additionally, the wrapper script should pass $@ to your actual script so any/all arguments defined in $FASTSEARCH/etc/NodeConf.xml will be passed along properly from the Node Controller to your service. The following is an example of a wrapper script:
    #!/bin/sh
    
    # export the proper python path
    export PYTHONPATH=":/path/to/python"
    
    # run the script (backgrounded)
    python $FASTSEARCH/lib/python2.6/yourmodule/yourservice.py $@ &
    
    # determine the process id of the python script
    SCRIPT_PID="$!"
    
    # upon receiving a SIGTERM, forward it to the process
    trap "kill -TERM $SCRIPT_PID" SIGTERM
    
    # wait for SIGTERM from nctrl
    wait
  3. Define the service in $FASTSEARCH/etc/NodeConf.xml
    Add the following to the end of the <startorder> tag:

    <proc>servicename</proc>

    Add the following to the end of the <node> tag, customizing as appropriate:

    <!-- My Custom Service -->
    <process name="servicename" description="My Custom Service">
            <start>
                    <executable>binaryname</executable>
                    <parameters>-p 16940 -v</parameters>
                    <port base="4004"/>
            </start>
            <outfile>servicename.scrap</outfile>
    </process>
  4. Reload the Node Controller configuration with the following:
    nctrl reloadcfg

And that’s it!  Now you should be able to start, stop, configure, and deploy your services using Fast ESP tools.  Enjoy!

Joomla! Workshop at MSBDC

On December 2, 2009, TNR Global  held a workshop “Create an Interactive Website for Your Business in One Day” for the Small Business Development Center in Springfield, MA.   The class was lead by Tamar Schanfeld, Joomla! Project Manager and Natasha Goncharova, Managing Director at TNR Global.

The workshop covered how to create a web site for small businesses using the open source Joomla! content management system. A wide range of small and medium business owners as well as non-profit organizations attended the seminar, including Crocker Communications, Community Health Center of Franklin County, Travellynne Services, Nelson Altman Catering, and more. Participants liked the seminar and gave positive reviews:

Excellent workshop! Day went fast!”
“You were very patient and thorough.”
“It was a lot of material to cover in one day but they did a good job.”

TNR Global will continue holding similar seminars for businesses as well as for non-profit organization.   If you are interested in organizing a Joomla! seminar for your business association or organization, please call 413-425-1499 ext 2.

Cloud Enabled Personalized Genomics

Focus on expertise, available, on-demand resources, and the agility to experiment with big ideas will continue to draw some personal genomics researchers to public cloud computing.

Personalized medicine is a goal of the Department of Health and Human Services. It is a driver of genomic research. It is one version of the future of medicine, using our unique genetic code toward the prevention of disease and the use of more effective or safer tailored drug therapies. Cloud computing enables access to the computational resources needed, on demand, for the data analysis needed to lay the groundwork for revolution in health care. Continue reading “Cloud Enabled Personalized Genomics”

A better way to add or update MySQL rows

Recently, we needed to iterate over a fairly large data set (on the order of millions) and do the ever-common If it’s not in the database, put it in.  If it’s already there, just update some fields. It’s a pattern that is very common for things like log files (where, for example, only a timestamp needs to be updated in some cases).

The obvious way of doing a SELECT, followed by either an UPDATE or an INSERT is too slow for even moderately-large datasets.  The better way to accomplish this is to use MySQL’s ON DUPLICATE KEY UPDATE directive.  By simply creating a unique key on the fields that should be different per-row, this syntax provides two specific benefits:

  • Allows batch (read: transaction) queries for large data
  • Increases performance overall versus making two separate queries

These benefits are especially helpful when your dataset is too large to fit into memory.  The obvious drawback to this method, however, is that it may put additional load on your database server.  Like anything else, it’s worth testing out your individual situation but, for us, ON DUPLICATE KEY UPDATE was the way to go.

Perl script documentation with Pod::Usage

One of the most important parts of maintainable and usable system administration scripts is documentation. Code comments are a key component to this, but so is usage documentation. In this post, I’ll address an easy way to add complete and useful usage documentation to your scripts using the module Pod::Usage.

Pod::Usage is a module that lets you easily convert Pod documentation to a help message or man page. To use it, you just need to include several Pod sections in your documentation. These include the NAME, SYNOPSIS, OPTIONS, and DESCRIPTION sections. Pod::Usage will use some of these sections to generate usage information and man pages. Once you’ve written all of your documentation, you can use Getopt::Long to capture options passed to the script like –help and –man. For these options, you can run the function ‘pod2usage’, with varying levels of verbosity.

For example, ‘pod2usage(-verbose => 1)’ will print out a short usage message (generated from the SYNOPSIS Pod section). To open a full man page in your default man viewer, you can use a verbosity of 2. Additionally, you can print out a string before your generated usage message, by providing the ‘-msg’ option inside the function call. For example, ‘pod2usage(-msg => “Not enough arguments”, -verbose => 1)’ will print “Not enough arguments”, followed by the usage message for your script.

You can find documentation on the Pod::Usage module as well as example code on perldoc.perl.org.