Migration from FAST ESP to Lucene Solr

Download the presentation and see the video.

Michael McIntosh, Vice President of Enterprise Search Technologies at TNR, spoke at the Lucene Revolution conference in Boston, MA October 7-8, 2010. Michael reviewed the migration from Fast ESP to Lucene/Solr open source search. He discussed approaches to identifying core content areas of HTML documents such as Text-To-Tag Ratio Heuristics and Page Stereotype/Site Template Analysis, and reviewed specific use cases that we have encountered as search integration experts and discuss available tools.

TNR Global was a sponsor of Lucene Revolution. The conference gathered over 400 professionals from the enterprise search industry. We were happy to see so much interest in Lucene/Solr open source search, and get to know and learn from the folks who have done large scale implementations, including Twitter, LinkedIn, and eHarmony.  Not surprisingly, there was a lot of interest about migration from proprietory search systems to Solr, especially from FAST ESP due to Microsoft’s discontinuing FAST ESP support for Linux.  If you would like to learn more about how a migration from FAST ESP to Lucene Solr can benefit your company, contact us for a free consultation.

How to create a duplicate ESP collection without re-crawling!

In a production (or even stable) ESP environment, it is difficult to make a change to the Document Processing Pipeline and test it without wiping out the existing collection (not to mention the time it takes to perform a full re-crawl if the collection is even moderately large). In this case, the best option is to use postprocess to feed existing documents to a new (empty) collection.

Making a duplicate collection provides several benefits:

  • No re-crawling is required
  • The original collection is not affected by pipeline changes
  • You can test your new collection without touching the stable data
  • Upon determining that your changes are producing good results, you can easily migrate your front-end to the new collection while still maintaining existing stable data in the original collection (in case you want to revert your changes)

Steps to make a duplicate collection

  1. Using the ESP Admin GUI, create a new collection with the pipeline you would like to use (or test, as the case may be)
  2. Do not specify any data sources when configuring the new collection
  3. Stop the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl stop crawler

  4. Run the following command where origcollection is the original collection and newcollection is the new collection (that you just created):

    $FASTSEARCH/bin/postprocess -R origcollection -k default:newcollection

    Notes about this command:

    • the default specified above is a content feeding destination, as specified in the destinations section of $FASTSEARCH/etc/CrawlerGlobalDefaults.xml. Specifying default will specify the destination as the current ESP install.
    • be sure to run the above command using either nohup or screen as it will not exit until all content has been fed to the new collection. For large collections this may take a while.
  5. Restart the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl start crawler

A better way to add or update MySQL rows

Recently, we needed to iterate over a fairly large data set (on the order of millions) and do the ever-common If it’s not in the database, put it in.  If it’s already there, just update some fields. It’s a pattern that is very common for things like log files (where, for example, only a timestamp needs to be updated in some cases).

The obvious way of doing a SELECT, followed by either an UPDATE or an INSERT is too slow for even moderately-large datasets.  The better way to accomplish this is to use MySQL’s ON DUPLICATE KEY UPDATE directive.  By simply creating a unique key on the fields that should be different per-row, this syntax provides two specific benefits:

  • Allows batch (read: transaction) queries for large data
  • Increases performance overall versus making two separate queries

These benefits are especially helpful when your dataset is too large to fit into memory.  The obvious drawback to this method, however, is that it may put additional load on your database server.  Like anything else, it’s worth testing out your individual situation but, for us, ON DUPLICATE KEY UPDATE was the way to go.

Leveraging your assets: Repurposing a physical server as an OpenVZ virtual server

As an active system administrator, part of my job is determining which systems require decommissioning due to the age of the OS or for other reasons. When readying a server for retirement, we’ll take the opportunity to move and upgrade the services running on that server. Often, we are then left with a perfectly good piece of hardware that has already been paid for and is still a valuable asset. A great way to leverage this equipment is virtualization- specifically, OpenVZ.

Oftentimes, servers are underutilized- especially when it comes to development work, or when running lower impact applications. Rather than deal with multiple users on a system, apache virtual hosts for multiple websites, worrying about secure file access, or one user or customer hogging a huge amount of resources, we have found that creating multiple virtual servers using OpenVZ is an ideal solution. I won’t delve into OpenVZ deployment other than to briefly note that on our CentOS and RedHat servers, installation is as simple as adding the correct repository and installing via yum (see here for more info). Once installed, a quick reboot into the new kernel and you are ready to roll. We are running 45-50 virtual servers (VEs, or containers) on one of our 2 QUAD core CPU, 8GB RAM servers, with plenty of room to spare. I recommend running ‘vzsplit’ to generate a good configuration basis for you VEs.

Once we have installed and configured OpenVZ on our new server, we are then able to deploy a large number of VEs for individual users or customers. Each VE provides the user with the ability to have root access, update and install their own software, deploy their own applications, etc. To the user- they are on their own complete system. Should their application misbehave, it won’t affect the others on the system.

Additionally, many resources can be adjusted on the fly. Running out of disk space? Increase it on the fly. Need more memory? Increase on the fly. Live resource management such as this is a very powerful way to leverage your hardware.

We currently are using OpenVZ for CMS development, custom programming development, building custom rpms, running websites, and various other testing where we need easily deployed servers which may or may not be needed of extended periods of time.

Virtualizing our own equipment in this way makes great economic sense for several reasons-. We are using a server which we already own, thus helping us increase our “green” sensibilities by keeping this system out of the landfill. We eliminate the need for more servers for development work. We can even host paying customers, thus deriving income from the hardware. Using our own equipment also helps us keep costs lower by lessening the need to move data and applications offsite to providers such as Slicehost.  Slicehost has it’s place- and, in fact, we use them for certain applications- but they do not provide the versatility necessary for much of our development work.
In summary, by leveraging existing, underutilized  or potentially retired hardware, you can save money in reduced additional hardware costs, derive income and help the environment. Additionally, the agility in development and deployment that we gain simply adds another layer to the economic advantages that we gain.That sounds like a good plan to me!

MySQL Error: BLOB/TEXT used in key specification without a key length

Recently, I was populating a database with lines from a number of log files.  One of the key pieces of information in each of these log lines was a URL.  Because URLs can be pretty much as long as they want to be (or can they?) I decided to make the URL field a Text type in my schema.  Then, because I wanted fast lookups, I tried to add an index (key) on this field and ran into this guy:

ERROR 1170 (42000): BLOB/TEXT column ‘field_name’ used in key specification 
without a key length

It turns out that MySQL can only index the first N characters of a Blob or Text field – but for a URL, that’s not good enough.  After talking it over with my team members, we decided to instead add a field – url_md5.  By storing the md5sum of each URL, we could index on the hash field and have both fast lookups and avoid worrying about domains like this fitting into a VARCHAR.

Amazon EC2 system restore

Recently, one of our small EC2 instances failed. While we had Nagios monitoring it, Nagios only provides alerts when services fail, or when the host goes down. In this case, the failure was on Amazon’s side- the hardware where our instance resided was failing.

Recently, one of our small EC2 instances failed.  While we had Nagios monitoring it, Nagios only provides alerts when services fail, or when the host goes down. In this case, the failure was on Amazon’s side- the hardware where our instance resided was failing.

Continue reading “Amazon EC2 system restore”

Amazon EC2 ’steals’ from you

As we implement more systems in the EC2 architecture, we are noticing a not so insignificant amount cpu cycles ’stolen’. What is a ’steal’ time? It is CPU time that is taken by the Xen hypervisor for something else other than your processes- from what I have read, other people’s processes. What we need to understand is how this affects performance. Does it truly matter? We have one virtual system that consistently has steal time of between 6-12%. That would mean that 6-12% of the CPU time we pay for is being used for instances other than our own. We will have to research this more to see what the true impact is on our systems, and if there is a way to mitigate it.

Per-directory access restrictions with Trac and Subversion (mod_svn)

The below assumes that you are requiring a login to access Subversion over http(s) and Trac, and that the credentials users use for each service are the same.

Your access control file:

Both Trac and mod_svn use the same access control file format, so you should only need one file for both. The file format is described in some detail here, but I’ll go over some basics. Continue reading “Per-directory access restrictions with Trac and Subversion (mod_svn)”

Using the Data::Dumper module to debug your Perl scripts

If you’re ever writing a Perl script that contains fairly complicated data structures (multi-dimensional arrays, hashes of arrays, etc), or you’re debugging code that contains complicated objects, it may be helpful to view the current state of those data structures in their entirety.

This is easy to do with Data::Dumper. Although the features the module provides are fairly complex, it’s easy to use it to just print out the contents of something. Do do this, you’ll first need to import the module with ‘use Data::Dumper;’. Then, to dump the data in an array(for example), simply do ‘print Dumper(@array)’. Continue reading “Using the Data::Dumper module to debug your Perl scripts”