Posts Tagged ‘Fast’

Fast to Lucene Solr: Choosing a Document Processing Pipeline for Solr

by Karen Lynn

One of the most powerful features of FAST ESP is its flexible document processing engine. The engine that ships with FAST ESP supports multiple document processing pipelines that comprise of multiple document processing stages. A document processing stage performs a document processing task and can add, modify or remove elements from a document before it is passed to the next stage in the pipeline. A simple example of processing stage would be one that processes a document’s URL element, ESP ships with many processing stages and several processing pipelines out of the box for handling both structured and unstructured documents. FAST ESP document processing engine also provides a Python plugin API to allow customers to create custom processing stages of their own, which is a feature we use heavily for our customer ESP installations.

Unfortunately, Solr does not offer the same robust support for document processing pipelines that ESP does. The ESP processing pipeline is document-centric while the Lucene Solr platform is field-centric. When a document is fed to ESP for processing, it is routed to processing stages in a processing pipeline that can access document elements generated by previous processing stages. This allows for complex and optimal operations that can leverage previous processing, such as reuse of a previously generated HTML DOM tree structures. When a document is passed to a Solr update handler, the document is broken up into a set of individual fields. Each field can have a set of processors known as Solr Analysis Filter that can be chained together for field processing before indexing occurs. While this is fine for content that has been heavily processed before being sent to Solr, individual filters lack the same level of access to other documents elements to easily support more complex processing behaviors.

Another difference between ESP and Solr platforms is that ESP’s document processing architecture allows it to be scaled independently from its indexing architecture. ESP’s document processing architecture is fully decoupled from its indexing architecture and is designed out-of-the-box to take advantage of multiple processor cores per machine and multiple document processor machines per cluster. Solr’s out-of-the-box document processing architecture is tightly coupled with its indexing architecture, making it difficult to independently scale Solr’s content processing capacity without adding the complexity and overhead of additional Solr services and Lucene indexes. When we work with multiple terabyte document sets, we find content processing tends to be the biggest bottleneck, so being able to scale content processing ability separately from indexing is mission critical.

If we want to leverage the power that Solr offers, but we need support for a more robust document processing framework, what are our options? There are quite a number of content processing frameworks we can chose from that we discovered during the course of our research. Some of the options currently available include, but are not limited to OpenPipeline OpenPipe, Pypes, UIMA, SMILA , Apache Commons Pipeline, Piped, Behemoth, and Cascading.

Most of these frameworks are written in Java which gives them access to an incredibly broad and diverse spectrum of Java libraries. Since Solr and Lucene are also written in Java, it might make a lot of sense to favor a Java processing framework from scratch, especially if you are more comfortable with Java as a programming language.

Since our clients tend to have highly customized document processing pipelines with many custom FAST ESP Python processing stages, we are heavily biased towards choosing a framework that minimizes the amount of code that would need to be migrated. Many of the available processing frameworks are written in Java, which would be fine if you prefer using Java and don’t have a large amount of currently working Python code to migrate. For our use cases, the decision of which framework to chose was incredibly simple given the option, so we chose Pypes for our migration solution.

For a full report on how we use Pypes for a Document Processing Engine including sample code, sign up for our free FAST to Lucene Solr White Paper here.

For Many Companies, Migration to a New Search Engine is Inevitable

by Karen Lynn

HADLEY, MA– March 12, 2012

In the world of Enterprise Search, everything is changing.  Companies who have been using Microsoft’s internal search engine, FAST Enterprise Search Platform, will be forced to make a change as Microsoft discontinues support for the search platform for companies using Linux as their operating system.  Anticipating the need for a solution, local technology consultants TNR Global is pleased to announce the release of a White Paper for migrating off FAST ESP to a new search engine, Solr.  The paper is titled Bridging the Gap: A Migration Path from Fast ESP to Apache Solr.

This effort began last October when TNR Global presented on the subject of migration from FAST to Solr at the open source conference, Apache Lucene Eurocon in Barcelona, Spain. The paper contains a case study with architecture overview, loading millions of documents into Solr indexes, evaluation and recommendation of tools to bridge the feature gap, migrating custom pipeline code, and the vastly improved ROI after implementation.  “It’s basically a road map for companies looking at options for migration, and we outline Solr as a very good option” said Karen E. Lynn, Director of Business Development.

“We have spent over 9 years working with the FAST ESP product and we understand the nuances of what customers have come to expect from the technology. We’ve identified Solr as a top choice for migrating off FAST as support for the product drops off” said Michael McIntosh, VP of Search Technologies and lead author of the paper. “Solr is an open source technology that has matured and is certainly stable enough for commercial use” said Chris Miles, Senior Software Engineer and contributor to the paper. “We’re excited about this migration option for our customers, and we believe over the long run, it will save them a lot of money and give them greater control over their search engine.”

This heavily anticipated paper will assist companies and organizations in planning their own FAST ESP to Apache Solr migrations and alert them to tools and techniques that can help them achieve a relatively painless process.  Several large blue chip companies have expressed interest in the paper.  “We’ve had a healthy response to the paper” said Lynn.

Internal search engines differ from public search engines like Google or Bing, in that an internal search engine only searches for content inside the company’s firewall.  Google cannot access internal content, therefore companies use search technology to make their content ‘findable.’ “Companies want to keep internal information safe and private.  But they still need to find it” explained Lynn.  “That’s why they need search technology integrated into their organization’s system.”

For more information on search engines, product search, web portals and search engine migration, visit TNR’s main website.  To receive a free copy of the white paper, click here.

TNR Global www.tnrglobal.com, is a systems design and integration company focused on enterprise search and cloud computing solutions for publishing companies, news sites, web directories, academia, enterprise, and SaaS companies. TNR’s past clients include the University of Massachusetts Amherst, Mass Art & Culture, InterNano, Innovara, and the Allegis Group. TNR Global is located at 245 Russell Street, Suite 10 in Hadley, Massachusetts. TNR Global serves clients throughout New England, nationally, and world-wide. Its offices are in Hadley and Greenfield, Massachusetts.