Document Processing Pipeline

Fast to Lucene Solr: Choosing a Document Processing Pipeline for Solr

If we want to leverage the power that Solr offers, but we need support for a more robust document processing framework, what are our options?

One of the most powerful features of FAST ESP is its flexible document processing engine. The engine that ships with FAST ESP supports multiple document processing pipelines that comprise of multiple document processing stages. A document processing stage performs a document processing task and can add, modify or remove elements from a document before it is passed to the next stage in the pipeline. A simple example of processing stage would be one that processes a document’s URL element, ESP ships with many processing stages and several processing pipelines out of the box for handling both structured and unstructured documents. FAST ESP document processing engine also provides a Python plugin API to allow customers to create custom processing stages of their own, which is a feature we use heavily for our customer ESP installations.

Unfortunately, Solr does not offer the same robust support for document processing pipelines that ESP does. The ESP processing pipeline is document-centric while the Lucene Solr platform is field-centric. When a document is fed to ESP for processing, it is routed to processing stages in a processing pipeline that can access document elements generated by previous processing stages. This allows for complex and optimal operations that can leverage previous processing, such as reuse of a previously generated HTML DOM tree structures. When a document is passed to a Solr update handler, the document is broken up into a set of individual fields. Each field can have a set of processors known as Solr Analysis Filter that can be chained together for field processing before indexing occurs. While this is fine for content that has been heavily processed before being sent to Solr, individual filters lack the same level of access to other documents elements to easily support more complex processing behaviors.

Another difference between ESP and Solr platforms is that ESP’s document processing architecture allows it to be scaled independently from its indexing architecture. ESP’s document processing architecture is fully decoupled from its indexing architecture and is designed out-of-the-box to take advantage of multiple processor cores per machine and multiple document processor machines per cluster. Solr’s out-of-the-box document processing architecture is tightly coupled with its indexing architecture, making it difficult to independently scale Solr’s content processing capacity without adding the complexity and overhead of additional Solr services and Lucene indexes. When we work with multiple terabyte document sets, we find content processing tends to be the biggest bottleneck, so being able to scale content processing ability separately from indexing is mission critical.

If we want to leverage the power that Solr offers, but we need support for a more robust document processing framework, what are our options? There are quite a number of content processing frameworks we can chose from that we discovered during the course of our research. Some of the options currently available include, but are not limited to OpenPipeline OpenPipe, Pypes, UIMA, SMILA , Apache Commons Pipeline, Piped, Behemoth, and Cascading.

Most of these frameworks are written in Java which gives them access to an incredibly broad and diverse spectrum of Java libraries. Since Solr and Lucene are also written in Java, it might make a lot of sense to favor a Java processing framework from scratch, especially if you are more comfortable with Java as a programming language.

Since our clients tend to have highly customized document processing pipelines with many custom FAST ESP Python processing stages, we are heavily biased towards choosing a framework that minimizes the amount of code that would need to be migrated. Many of the available processing frameworks are written in Java, which would be fine if you prefer using Java and don’t have a large amount of currently working Python code to migrate. For our use cases, the decision of which framework to chose was incredibly simple given the option, so we chose Pypes for our migration solution.

For a full report on how we use Pypes for a Document Processing Engine including sample code, sign up for our free FAST to Lucene Solr White Paper here.

How to create a duplicate ESP collection without re-crawling!

In a production (or even stable) ESP environment, it is difficult to make a change to the Document Processing Pipeline and test it without wiping out the existing collection (not to mention the time it takes to perform a full re-crawl if the collection is even moderately large). In this case, the best option is to use postprocess to feed existing documents to a new (empty) collection.

Making a duplicate collection provides several benefits:

No re-crawling is required
The original collection is not affected by pipeline changes
You can test your new collection without touching the stable data
Upon determining that your changes are producing good results, you can easily migrate your front-end to the new collection while still maintaining existing stable data in the original collection (in case you want to revert your changes)

Steps to make a duplicate collection

Using the ESP Admin GUI, create a new collection with the pipeline you would like to use (or test, as the case may be)
Do not specify any data sources when configuring the new collection
Stop the Enterprise Crawler:

$FASTSEARCH/bin/nctrl stop crawler
Run the following command where origcollection is the original collection and newcollection is the new collection (that you just created):

$FASTSEARCH/bin/postprocess -R origcollection -k default:newcollection

Notes about this command:
- the default specified above is a content feeding destination, as specified in the destinations section of $FASTSEARCH/etc/CrawlerGlobalDefaults.xml. Specifying default will specify the destination as the current ESP install.
- be sure to run the above command using either nohup or screen as it will not exit until all content has been fed to the new collection. For large collections this may take a while.
Restart the Enterprise Crawler:

$FASTSEARCH/bin/nctrl start crawler

Fast ESP Error: no doc procs registered to process a batch with priority 0

Just wanted to take this error message off of the, “Hey, we’ve seen this before… now how did we resolve this..?” pile. This is the full text of the error:

WARNING    Could not send batch to ESP content distributor, will retry automatically.
Reason given: process() failed: exception (no_resources) no doc procs registered to 
process a batch with priority 0

At first glance, it looks pretty clear that you just need to [re]start your document processor(s). However, this won’t necessarily solve the problem. Turns out that the a likely reason for this to pop up is a bad Document Processing Pipeline (DPP) Stage. The docprocs fire up, hit the bad stage (e.g. python errors etc.) and don’t recover.

To debug your DPP Stage, take a look at the logs for the document processor(s). They’re usually located in $FASTSEARCH/var/log/procserver and, in our experience, there’s probably an uncaught python exception lurking somewhere in there.

Microsoft FAST ESP Overview

February 10, 2009

By Michael McIntosh, Senior Search Software Engineer

TNR Global

We use Microsoft FAST ESP to power a large industrial search engine listing over 1 million companies and over 3 million indexed documents and receiving millions of visitors every month. I have been working with ESP since 2003 (then known as FDS 3.2).

Microsoft FAST ESP is extremely flexible and can deal with indexing many document types (html, pdf, word, etc). It has a very robust crawler for web documents and you can use their intermediary FastXML format to load custom document formats into the system or use their Content APIs.

One of my favorite parts of the engine is its Document Processing Pipeline which lets you make use of dozens of out-of-the-box processing plugins as well as using a Python API to write your own custom document processing stages. An example of a custom stage we wrote was one that looks at a web site URL and tries to identify which company it belongs to so additional metadata can be attached to a web document.

It has a very robust programming/integration SDK in several popular languages (C++/C#/Java) for adding content and performing queries as well as fetching system status and managing cluster services.

ESP has a query language called FAST Query Language (FQL) that is very robust and allows you to do basic Boolean searches (AND, OR, NOT) as well as phrase and term proximity searches. In addition to that, it has something called “scope search” which can be used to search document metadata (XML) that has a format that can vary from document to document.

In terms of performance, it scales fairly linearly. If you benchmark it to determine how it performs on one machine, if you add another machine it generally can double performance. You can run the system on one machine (only recommended for development), or many (for production). It is fault-tolerant (it can still serve some results if one of your load-balanced indices goes offline) and it has full fail-over support (one or more critical machines could die or be taken offline for maintenance and the system will continue to function properly)

So, its very powerful. The documentation nowadays is pretty good. So, you ask, what are the downsides?

Well, if the data you need to make searchable has a format that changes frequently, that might be a pain. FAST ESP has something called an “Index Profile” which is basically a config file it uses to determine what document fields are important and should be used for indexing. Everything fed into ESP is a “document”, even if your loading database table rows into it. Each document has several fields, typical fields being: title, body, keywords, headers, documentvectors, processingtime, etc. You can specify as many of your own custom fields as you wish.

If your content maintains mostly the same format (like web documents) its not a big issue. But if you have to make big changes to which fields should be indexed and how they should be treated, you probably need to edit the Index Profile. Some changes to the index profile are “Hot Updates”, meaning you can make the change and not interrupt service. But, some of the bigger changes are “Cold Updates” which requires a full data refeed and indexing before the change takes effect. Depending on the size of your dataset and how many machines are in your cluster, this operation could take hours or days. Cold Updates are a pain to schedule unless you have plenty of cash for extra hardware that you can bring online while your production systems are performing a cold update and reloading the data. Having to do that on production clusters more than once or twice a year requires a fair amount of planning to get right with minimum or 0% downtime. Learn more about some of the ways we help our customers get the most from their FAST installations.