Posts Tagged ‘document processing pipeline’

Fast to Lucene Solr: Choosing a Document Processing Pipeline for Solr

by Karen Lynn

One of the most powerful features of FAST ESP is its flexible document processing engine. The engine that ships with FAST ESP supports multiple document processing pipelines that comprise of multiple document processing stages. A document processing stage performs a document processing task and can add, modify or remove elements from a document before it is passed to the next stage in the pipeline. A simple example of processing stage would be one that processes a document’s URL element, ESP ships with many processing stages and several processing pipelines out of the box for handling both structured and unstructured documents. FAST ESP document processing engine also provides a Python plugin API to allow customers to create custom processing stages of their own, which is a feature we use heavily for our customer ESP installations.

Unfortunately, Solr does not offer the same robust support for document processing pipelines that ESP does. The ESP processing pipeline is document-centric while the Lucene Solr platform is field-centric. When a document is fed to ESP for processing, it is routed to processing stages in a processing pipeline that can access document elements generated by previous processing stages. This allows for complex and optimal operations that can leverage previous processing, such as reuse of a previously generated HTML DOM tree structures. When a document is passed to a Solr update handler, the document is broken up into a set of individual fields. Each field can have a set of processors known as Solr Analysis Filter that can be chained together for field processing before indexing occurs. While this is fine for content that has been heavily processed before being sent to Solr, individual filters lack the same level of access to other documents elements to easily support more complex processing behaviors.

Another difference between ESP and Solr platforms is that ESP’s document processing architecture allows it to be scaled independently from its indexing architecture. ESP’s document processing architecture is fully decoupled from its indexing architecture and is designed out-of-the-box to take advantage of multiple processor cores per machine and multiple document processor machines per cluster. Solr’s out-of-the-box document processing architecture is tightly coupled with its indexing architecture, making it difficult to independently scale Solr’s content processing capacity without adding the complexity and overhead of additional Solr services and Lucene indexes. When we work with multiple terabyte document sets, we find content processing tends to be the biggest bottleneck, so being able to scale content processing ability separately from indexing is mission critical.

If we want to leverage the power that Solr offers, but we need support for a more robust document processing framework, what are our options? There are quite a number of content processing frameworks we can chose from that we discovered during the course of our research. Some of the options currently available include, but are not limited to OpenPipeline OpenPipe, Pypes, UIMA, SMILA , Apache Commons Pipeline, Piped, Behemoth, and Cascading.

Most of these frameworks are written in Java which gives them access to an incredibly broad and diverse spectrum of Java libraries. Since Solr and Lucene are also written in Java, it might make a lot of sense to favor a Java processing framework from scratch, especially if you are more comfortable with Java as a programming language.

Since our clients tend to have highly customized document processing pipelines with many custom FAST ESP Python processing stages, we are heavily biased towards choosing a framework that minimizes the amount of code that would need to be migrated. Many of the available processing frameworks are written in Java, which would be fine if you prefer using Java and don’t have a large amount of currently working Python code to migrate. For our use cases, the decision of which framework to chose was incredibly simple given the option, so we chose Pypes for our migration solution.

For a full report on how we use Pypes for a Document Processing Engine including sample code, sign up for our free FAST to Lucene Solr White Paper here.

How to create a duplicate ESP collection without re-crawling!

by admin

In a production (or even stable) ESP environment, it is difficult to make a change to the Document Processing Pipeline and test it without wiping out the existing collection (not to mention the time it takes to perform a full re-crawl if the collection is even moderately large). In this case, the best option is to use postprocess to feed existing documents to a new (empty) collection.

Making a duplicate collection provides several benefits:

  • No re-crawling is required
  • The original collection is not affected by pipeline changes
  • You can test your new collection without touching the stable data
  • Upon determining that your changes are producing good results, you can easily migrate your front-end to the new collection while still maintaining existing stable data in the original collection (in case you want to revert your changes)

Steps to make a duplicate collection

  1. Using the ESP Admin GUI, create a new collection with the pipeline you would like to use (or test, as the case may be)
  2. Do not specify any data sources when configuring the new collection
  3. Stop the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl stop crawler

  4. Run the following command where origcollection is the original collection and newcollection is the new collection (that you just created):

    $FASTSEARCH/bin/postprocess -R origcollection -k default:newcollection

    Notes about this command:

    • the default specified above is a content feeding destination, as specified in the destinations section of $FASTSEARCH/etc/CrawlerGlobalDefaults.xml. Specifying default will specify the destination as the current ESP install.
    • be sure to run the above command using either nohup or screen as it will not exit until all content has been fed to the new collection. For large collections this may take a while.
  5. Restart the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl start crawler

Fast ESP Error: no doc procs registered to process a batch with priority 0

by admin

Just wanted to take this error message off of the, “Hey, we’ve seen this before… now how did we resolve this..?” pile.  This is the full text of the error:

WARNING    Could not send batch to ESP content distributor, will retry automatically.
Reason given: process() failed: exception (no_resources) no doc procs registered to
process a batch with priority 0

At first glance, it looks pretty clear that you just need to [re]start your document processor(s).  However, this won’t necessarily solve the problem.  Turns out that the a likely reason for this to pop up is a bad Document Processing Pipeline (DPP) Stage.  The docprocs fire up, hit the bad stage (e.g. python errors etc.) and don’t recover.

To debug your DPP Stage, take a look at the logs for the document processor(s).  They’re usually located in $FASTSEARCH/var/log/procserver and, in our experience, there’s probably an uncaught python exception lurking somewhere in there.