Posts Tagged ‘Lucene’

Fast to Lucene Solr: Choosing a Document Processing Pipeline for Solr

by Karen Lynn

One of the most powerful features of FAST ESP is its flexible document processing engine. The engine that ships with FAST ESP supports multiple document processing pipelines that comprise of multiple document processing stages. A document processing stage performs a document processing task and can add, modify or remove elements from a document before it is passed to the next stage in the pipeline. A simple example of processing stage would be one that processes a document’s URL element, ESP ships with many processing stages and several processing pipelines out of the box for handling both structured and unstructured documents. FAST ESP document processing engine also provides a Python plugin API to allow customers to create custom processing stages of their own, which is a feature we use heavily for our customer ESP installations.

Unfortunately, Solr does not offer the same robust support for document processing pipelines that ESP does. The ESP processing pipeline is document-centric while the Lucene Solr platform is field-centric. When a document is fed to ESP for processing, it is routed to processing stages in a processing pipeline that can access document elements generated by previous processing stages. This allows for complex and optimal operations that can leverage previous processing, such as reuse of a previously generated HTML DOM tree structures. When a document is passed to a Solr update handler, the document is broken up into a set of individual fields. Each field can have a set of processors known as Solr Analysis Filter that can be chained together for field processing before indexing occurs. While this is fine for content that has been heavily processed before being sent to Solr, individual filters lack the same level of access to other documents elements to easily support more complex processing behaviors.

Another difference between ESP and Solr platforms is that ESP’s document processing architecture allows it to be scaled independently from its indexing architecture. ESP’s document processing architecture is fully decoupled from its indexing architecture and is designed out-of-the-box to take advantage of multiple processor cores per machine and multiple document processor machines per cluster. Solr’s out-of-the-box document processing architecture is tightly coupled with its indexing architecture, making it difficult to independently scale Solr’s content processing capacity without adding the complexity and overhead of additional Solr services and Lucene indexes. When we work with multiple terabyte document sets, we find content processing tends to be the biggest bottleneck, so being able to scale content processing ability separately from indexing is mission critical.

If we want to leverage the power that Solr offers, but we need support for a more robust document processing framework, what are our options? There are quite a number of content processing frameworks we can chose from that we discovered during the course of our research. Some of the options currently available include, but are not limited to OpenPipeline OpenPipe, Pypes, UIMA, SMILA , Apache Commons Pipeline, Piped, Behemoth, and Cascading.

Most of these frameworks are written in Java which gives them access to an incredibly broad and diverse spectrum of Java libraries. Since Solr and Lucene are also written in Java, it might make a lot of sense to favor a Java processing framework from scratch, especially if you are more comfortable with Java as a programming language.

Since our clients tend to have highly customized document processing pipelines with many custom FAST ESP Python processing stages, we are heavily biased towards choosing a framework that minimizes the amount of code that would need to be migrated. Many of the available processing frameworks are written in Java, which would be fine if you prefer using Java and don’t have a large amount of currently working Python code to migrate. For our use cases, the decision of which framework to chose was incredibly simple given the option, so we chose Pypes for our migration solution.

For a full report on how we use Pypes for a Document Processing Engine including sample code, sign up for our free FAST to Lucene Solr White Paper here.

FAST ESP to Lucene Solr Presentation: Open Call for Questions

by Karen Lynn

TNR Global is excited to be participating in the Apache Lucene EuroCon conference in Barcelona.  Our own Michael McIntosh is scheduled to present:  “Enterprise Search: FAST ESP to Lucene Solr” Here is your chance to pre-load the discussion. Before Michael puts the final touches on his talk, he wants to know what issues or questions you may be have.  In the following video, he touches on some of the highlights of his upcoming talk, and asks for your input.

Enterprise Search: FAST ESP to Lucene Solr pre-conferece video - Click to Watch

Enterprise Search: FAST ESP to Lucene Solr pre-conf video

To participate in advance, send you questions or comments to:  fast2solr@tnrglobal.com.  While Michael cannot promise he will include your question or commentary in his actual talk, he will work to address them in an upcoming White Paper, to be released after the conference in November 2011. We look forward to hearing from you!

Migration from FAST ESP to Lucene Solr

by Tamar Schanfeld

Download the presentation and see the video.

Michael McIntosh, Vice President of Enterprise Search Technologies at TNR, spoke at the Lucene Revolution conference in Boston, MA October 7-8, 2010. Michael reviewed the migration from Fast ESP to Lucene/Solr open source search. He discussed approaches to identifying core content areas of HTML documents such as Text-To-Tag Ratio Heuristics and Page Stereotype/Site Template Analysis, and reviewed specific use cases that we have encountered as search integration experts and discuss available tools.

TNR Global was a sponsor of Lucene Revolution. The conference gathered over 400 professionals from the enterprise search industry. We were happy to see so much interest in Lucene/Solr open source search, and get to know and learn from the folks who have done large scale implementations, including Twitter, LinkedIn, and eHarmony.  Not surprisingly, there was a lot of interest about migration from proprietory search systems to Solr, especially from FAST ESP due to Microsoft’s discontinuing FAST ESP support for Linux.  If you would like to learn more about how a migration from FAST ESP to Lucene Solr can benefit your company, contact us for a free consultation.