Fast to Lucene Solr: Choosing a Document Processing Pipeline for Solr

If we want to leverage the power that Solr offers, but we need support for a more robust document processing framework, what are our options?

One of the most powerful features of FAST ESP is its flexible document processing engine. The engine that ships with FAST ESP supports multiple document processing pipelines that comprise of multiple document processing stages. A document processing stage performs a document processing task and can add, modify or remove elements from a document before it is passed to the next stage in the pipeline. A simple example of processing stage would be one that processes a document’s URL element, ESP ships with many processing stages and several processing pipelines out of the box for handling both structured and unstructured documents. FAST ESP document processing engine also provides a Python plugin API to allow customers to create custom processing stages of their own, which is a feature we use heavily for our customer ESP installations.

Unfortunately, Solr does not offer the same robust support for document processing pipelines that ESP does. The ESP processing pipeline is document-centric while the Lucene Solr platform is field-centric. When a document is fed to ESP for processing, it is routed to processing stages in a processing pipeline that can access document elements generated by previous processing stages. This allows for complex and optimal operations that can leverage previous processing, such as reuse of a previously generated HTML DOM tree structures. When a document is passed to a Solr update handler, the document is broken up into a set of individual fields. Each field can have a set of processors known as Solr Analysis Filter that can be chained together for field processing before indexing occurs. While this is fine for content that has been heavily processed before being sent to Solr, individual filters lack the same level of access to other documents elements to easily support more complex processing behaviors.

Another difference between ESP and Solr platforms is that ESP’s document processing architecture allows it to be scaled independently from its indexing architecture. ESP’s document processing architecture is fully decoupled from its indexing architecture and is designed out-of-the-box to take advantage of multiple processor cores per machine and multiple document processor machines per cluster. Solr’s out-of-the-box document processing architecture is tightly coupled with its indexing architecture, making it difficult to independently scale Solr’s content processing capacity without adding the complexity and overhead of additional Solr services and Lucene indexes. When we work with multiple terabyte document sets, we find content processing tends to be the biggest bottleneck, so being able to scale content processing ability separately from indexing is mission critical.

If we want to leverage the power that Solr offers, but we need support for a more robust document processing framework, what are our options? There are quite a number of content processing frameworks we can chose from that we discovered during the course of our research. Some of the options currently available include, but are not limited to OpenPipeline OpenPipe, Pypes, UIMA, SMILA , Apache Commons Pipeline, Piped, Behemoth, and Cascading.

Most of these frameworks are written in Java which gives them access to an incredibly broad and diverse spectrum of Java libraries. Since Solr and Lucene are also written in Java, it might make a lot of sense to favor a Java processing framework from scratch, especially if you are more comfortable with Java as a programming language.

Since our clients tend to have highly customized document processing pipelines with many custom FAST ESP Python processing stages, we are heavily biased towards choosing a framework that minimizes the amount of code that would need to be migrated. Many of the available processing frameworks are written in Java, which would be fine if you prefer using Java and don’t have a large amount of currently working Python code to migrate. For our use cases, the decision of which framework to chose was incredibly simple given the option, so we chose Pypes for our migration solution.

For a full report on how we use Pypes for a Document Processing Engine including sample code, sign up for our free FAST to Lucene Solr White Paper here.

For Many Companies, Migration to a New Search Engine is Inevitable

HADLEY, MA– March 12, 2012

In the world of Enterprise Search, everything is changing.  Companies who have been using Microsoft’s internal search engine, FAST Enterprise Search Platform, will be forced to make a change as Microsoft discontinues support for the search platform for companies using Linux as their operating system.  Anticipating the need for a solution, local technology consultants TNR Global is pleased to announce the release of a White Paper for migrating off FAST ESP to a new search engine, Solr.  The paper is titled Bridging the Gap: A Migration Path from Fast ESP to Apache Solr.

This effort began last October when TNR Global presented on the subject of migration from FAST to Solr at the open source conference, Apache Lucene Eurocon in Barcelona, Spain. The paper contains a case study with architecture overview, loading millions of documents into Solr indexes, evaluation and recommendation of tools to bridge the feature gap, migrating custom pipeline code, and the vastly improved ROI after implementation.  “It’s basically a road map for companies looking at options for migration, and we outline Solr as a very good option” said Karen E. Lynn, Director of Business Development.

“We have spent over 9 years working with the FAST ESP product and we understand the nuances of what customers have come to expect from the technology. We’ve identified Solr as a top choice for migrating off FAST as support for the product drops off” said Michael McIntosh, VP of Search Technologies and lead author of the paper. “Solr is an open source technology that has matured and is certainly stable enough for commercial use” said Chris Miles, Senior Software Engineer and contributor to the paper. “We’re excited about this migration option for our customers, and we believe over the long run, it will save them a lot of money and give them greater control over their search engine.”

This heavily anticipated paper will assist companies and organizations in planning their own FAST ESP to Apache Solr migrations and alert them to tools and techniques that can help them achieve a relatively painless process.  Several large blue chip companies have expressed interest in the paper.  “We’ve had a healthy response to the paper” said Lynn.

Internal search engines differ from public search engines like Google or Bing, in that an internal search engine only searches for content inside the company’s firewall.  Google cannot access internal content, therefore companies use search technology to make their content ‘findable.’ “Companies want to keep internal information safe and private.  But they still need to find it” explained Lynn.  “That’s why they need search technology integrated into their organization’s system.”

For more information on search engines, product search, web portals and search engine migration, visit TNR’s main website.  To receive a free copy of the white paper, click here.

TNR Global www.tnrglobal.com, is a systems design and integration company focused on enterprise search and cloud computing solutions for publishing companies, news sites, web directories, academia, enterprise, and SaaS companies. TNR’s past clients include the University of Massachusetts Amherst, Mass Art & Culture, InterNano, Innovara, and the Allegis Group. TNR Global is located at 245 Russell Street, Suite 10 in Hadley, Massachusetts. TNR Global serves clients throughout New England, nationally, and world-wide. Its offices are in Hadley and Greenfield, Massachusetts.

For Many Companies, Migration to a New Search Engine is Inevitable

“It’s basically a road map for companies looking at options for migration, and we outline Solr as a very good option”

HADLEY, MA– March 12, 2012

In the world of Enterprise Search, everything is changing.  Companies who have been using Microsoft’s internal search engine, FAST Enterprise Search Platform, will be forced to make a change as Microsoft discontinues support for the search platform for companies using Linux as their operating system.  Anticipating the need for a solution, local technology consultants TNR Global is pleased to announce the release of a White Paper for migrating off FAST ESP to a new search engine, Solr.  The paper is titled Bridging the Gap: A Migration Path from Fast ESP to Apache Solr.

This effort began last October when TNR Global presented on the subject of migration from FAST to Solr at the open source conference, Apache Lucene Eurocon in Barcelona, Spain. The paper contains a case study with architecture overview, loading millions of documents into Solr indexes, evaluation and recommendation of tools to bridge the feature gap, migrating custom pipeline code, and the vastly improved ROI after implementation.  “It’s basically a road map for companies looking at options for migration, and we outline Solr as a very good option” said Karen E. Lynn, Director of Business Development.

“We have spent over 9 years working with the FAST ESP product and we understand the nuances of what customers have come to expect from the technology. We’ve identified Solr as a top choice for migrating off FAST as support for the product drops off” said Michael McIntosh, VP of Search Technologies and lead author of the paper. “Solr is an open source technology that has matured and is certainly stable enough for commercial use” said Chris Miles, Senior Software Engineer and contributor to the paper. “We’re excited about this migration option for our customers, and we believe over the long run, it will save them a lot of money and give them greater control over their search engine.”

This heavily anticipated paper will assist companies and organizations in planning their own FAST ESP to Apache Solr migrations and alert them to tools and techniques that can help them achieve a relatively painless process.  Several large blue chip companies have expressed interest in the paper.  “We’ve had a healthy response to the paper” said Lynn.

Internal search engines differ from public search engines like Google or Bing, in that an internal search engine only searches for content inside the company’s firewall.  Google cannot access internal content, therefore companies use search technology to make their content ‘findable.’ “Companies want to keep internal information safe and private.  But they still need to find it” explained Lynn.  “That’s why they need search technology integrated into their organization’s system.”

For more information on search engines, product search, web portals and search engine migration, visit TNR’s main website.  To receive a free copy of the white paper, click here.

TNR Global www.tnrglobal.com, is a systems design and integration company focused on enterprise search and cloud computing solutions for publishing companies, news sites, web directories, academia, enterprise, and SaaS companies. TNR’s past clients include the University of Massachusetts Amherst, Mass Art & Culture, InterNano, Innovara, and the Allegis Group. TNR Global is located at 245 Russell Street, Suite 10 in Hadley, Massachusetts. TNR Global serves clients throughout New England, nationally, and world-wide. Its offices are in Hadley and Greenfield, Massachusetts.

UK Software Company TwigKit Partners with TNR Global to Deliver Search Solutions for FAST and Solr

“TNR’s focus on implementing and servicing enterprise search solutions across a number of platforms is an excellent fit for TwigKit,” says Stefan Olafsson, TwigKit’s co-founder and chief architect.

Hadley, MA–November 28, 2011–TNR Global announced today that they have entered into a strategic partnership with London, UK software company TwigKit.

“Our companies have a number of qualities in common that allow us to combine forces and service clients with a very complete solution” says Karen Lynn, TNR Global’s Director of Business Development. “TwigKit has a very appealing user interface for users across several platforms, and TNR’s strength is on creating a powerful back end search application. Combined, it’s a powerful solution for companies needing a strong search function with an easy to use interface.”

logo-twigkit-light
Representatives from the two companies have been in friendly talks for over a year now, meeting periodically at industry conferences. Both companies were in attendance at the Apache Lucene EuroCon conference in Barcelona last October, where the partnership was formalized.  Both presentations from TwigKit and TNR can be viewed here.

“TNR’s focus on implementing and servicing enterprise search solutions across a number of platforms is an excellent fit for TwigKit,” says Stefan Olafsson, TwigKit’s co-founder and chief architect. “Our software enables polished user interfaces for search-based applications, provides a rapid development framework, and works across a number of enterprise search platforms including Microsoft FAST and Apache Solr. We’re excited about working with TNR to produce search solutions that boast both a superb user experience and an outstanding technical implementation.”

TwigKit powers enterprise search applications in government and blue-chip organizations. Encapsulating search best practices into configurable components, TwigKit establishes a platform-independent standard compatible with most search technologies including Microsoft FAST, Google Search Appliance, and Apache Solr. Started in London in 2009, TwigKit’s founders organize the 350-member Enterprise Search London meetup, regularly speak at conferences, and write about search and user experience for publications including A List Apart, Boxes & Arrows, and UX Magazine.

TNR LOGO

TNR Global (TNR) is a systems design and integration company focused on enterprise search and cloud computing solutions. TNR develops scalable web-based search solutions built on the open source LAMP stack. TNR has over 10 years of hands-on experience in web systems and enterprise search implementations, both proprietary and open source search technologies, specializing in FAST and Lucene Solr search applications. Specifically TNR works with content intensive websites for companies and organizations in the following industries: News Sites, Publishing, Web Directories, Information Portals, Web Catalogs, Education, Manufacturing and Distribution, Customer Service, and Life Sciences. TNR Global has offices in western Massachusetts.

TNR Global Attends KMWorld’s Enterprise Search Summit Fall 2011

A proof of concept and rapid integration are essential for search customers–they cannot visualize what a search solution will look like without some help from the search professional.

ESSFallLogo

Last week TNR Global attended the Enterprise Search Summit organized by KMWorld in Washington, DC.  VP of Search Technologies Michael McIntosh and Director of Business Development Karen E. Lynn attended the three day conference and Enterprise Solutions Showcase at the Marriott Wardman Park.  Several companies were in attendance, and some common themes emerged.  Among these were designing for users, dealing with unstrcutured content, the need for better search and content analytics to facilitate better search results, as well as tagging content as part of a best practice in workflow.  Also discussed was the need for search vendors to demonstrate to search customers was “right looks like” in a search solution.  A proof of concept and rapid integration are essential for search customers–they cannot visualize what a search solution will look like without some help from the search professional.

An unexpected surprise came when the speaker on open source search was unable to attend at the last moment, our own Michael McIntosh was asked to step in and present on the subject.  Fortunately, he was fresh from his presentation at Apache Lucene EuroCon and already had his presentation loaded on his machine.  Michael discussed Solr and made general points on migrating from a commercial search engine like FAST ESP to a open source platform like Lucene Solr.

Overall it was a great conference with lots of informative talks and friendly search professionals.  We’re looking forward to the next Enterprise Search Summit in Spring, 2012.

We’re at Apache Lucene EuroCon in Barcelona 2011

“We’re certain that the urgency to migrate off FAST ESP will be ramping up significantly.”

We’re very excited to be in attendance at the Apache Lucene EuroCon in Barcelona October 17, 18, 19, and 20th, 2011.  Our own Michael McIntosh, VP of Search Technologies will be presenting a talk on October 19th, Enterprise Search: FAST ESP to Lucene Solr.  The good folks at Lucid Imagination are presenting the conference and will be video recording his talk for future broadcast.

Barcelona_Logo_Shaded

After the conference, Michael will author a White Paper on migrating from FAST ESP to Lucene Solr, expected in November 2011.  For a free copy of the White Paper, email us expressing your interest at fast2solr@tnrglobal.com. We believe that those businesses operating on a Linux system will be seeking out the power of Lucene Solr as their licenses expire and support for FAST ESP dries up. We’ve worked with FAST ESP for 7 years and understand it’s strengths and weaknesses.  We know businesses who are used to the power of FAST ESP will need something just as powerful, and Lucene Solr is a very nice fit.  “It’s a robust platform, capable of a lot that FAST ESP covers,” said Michael.  “We’re certain that the urgency to migrate off FAST ESP will be ramping up significantly.”

FAST ESP to Lucene Solr Presentation: Open Call for Questions

To pre-load the discussion on Michael’s Enterprise Search: FAST ESP to Lucene Solr talk, send your questions to: fast2solr@tnrglobal.com We want to hear from you!

TNR Global is excited to be participating in the Apache Lucene EuroCon conference in Barcelona.  Our own Michael McIntosh is scheduled to present:  “Enterprise Search: FAST ESP to Lucene Solr” Here is your chance to pre-load the discussion. Before Michael puts the final touches on his talk, he wants to know what issues or questions you may be have.  In the following video, he touches on some of the highlights of his upcoming talk, and asks for your input.

Enterprise Search: FAST ESP to Lucene Solr pre-conferece video - Click to Watch
Enterprise Search: FAST ESP to Lucene Solr pre-conf video

To participate in advance, send you questions or comments to:  fast2solr@tnrglobal.com.  While Michael cannot promise he will include your question or commentary in his actual talk, he will work to address them in an upcoming White Paper, to be released after the conference in November 2011. We look forward to hearing from you!

Crawling Solr

“We are looking at creating a suitable enterprise crawler to replace the one provided by ESP to support customers doing a ESP to Solr migration.”

Recently there has been a lively discussion on Linked In’s Enterprise Search Engine Professionals Group started with this question:


“Is it an handicap for Solr to depend on third party solutions for crawling the Web like Nutch?


Our own Michael McIntosh felt compelled to respond. What follows is his post to this topic in it’s entirety.


“This topic makes me think of the saying “Write programs that do one thing and do it well.” The longer version of this philosophy, as expressed by Doug McIlroy, is this: “Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” Solr stands very well on its own and, based upon my impression of the Solr community so far, more people currently use Solr for structured content vs unstructured like web documents. I think that Solr should have some ‘out of the box’ web crawler implementation available, but it should not be the core focus. It can serve to allow new users of Solr to focus more on the Solr/Lucene side of things and not have to worry about rolling their own crawler or figuring out which is the best third-party crawling solution to use. I suspect that many people who need to do crawling can get by with a fairly basic crawler. My impression of Nutch so far is that is more complicated than most Solr users need out of the starting gate. That said, if you have a business that deals with large amounts of crawled unstructured content, its very likely they will need something more robust than you can reasonably ship & support as part of the Solr project. For one of our clients, the size of our dataset has grown from needed just a couple boxes, to multiple clusters with many machines each. One of the newest developments is the growth of the amount of unstructured content has grown to a size where we now need a crawler CLUSTER. When we first started on this, it never occurred to us that we might need multiple machines for the crawling side of the equation, but it has happened. But I think our case its less common. All in all, I think Solr should have a bare-bones reference implementation of a crawler that can easily be expanded upon, but it is probably not an effective use of effort to Solr developers to focus on the crawling side. Let a third party focus on the issues of crawling, it is a deceptively complicated issue.”


After his post I caught him in the office and asked where he was going with this line of thinking. “We are looking at creating a suitable enterprise crawler to replace the one provided by ESP to support customers doing a ESP to Solr migration.” He revealed. Sounds like a very promising solution to a fairly big, and common problem for companies with vast amounts of metadata. And as for unstructured content? Well, it’s the proverbial elephant in the room, don’t you think?


To see the entire conversation, with contributions from experts in the field of search architecture, click here. To get in touch with Michael directly to discuss your architecture and crawling needs, contact us.

Open Source Search: Isn’t It Expensive?

You’ve heard the debate on open source search vs. proprietary search. One question that constantly comes up for prospective clients is “What’s all this going to cost me?”

In these times, it’s a good question. Because proprietary has neatly packaged, practically shrink wrapped plans, it’s much easier to discern how much you will spend on a solution. But how much will it cost? That’s an entirely different question.

I see you cocking your head sideways.

Proprietary search has hidden costs. What if the software doesn’t perform the way you need it to? Does the software understand the nuances of your business? How adaptable is it? How much will it cost to adapt that software to get it to perform the way my business needs it to? Questions like this need to be asked, and answered. Eventually you will ask yourself….why am I paying for all of this? And your developer will ask, “why can’t I access the source code?”

What I’m getting at is this: It is a reassuring feeling for a customer to see what a package costs, to understand what services you will get with a solution, and to anticipate what the licensing fee will cost on an annual basis. If it’s your job to research a solution and present findings to your executive team to make a decision, then proprietary search, on the surface, seems a more secure choice. But rarely, if ever, are these solutions a perfect fit for the customer. It’s like buying a Ferrari, with all the brand recognition and polish a Ferrari offers, and not ever driving it past second gear, or cutting the wheel more than 15 degrees, or getting a chance to have your trusted mechanic look under the hood. This is why open source is such a good solution for businesses who want their IT to move quickly.

We’re hearing more buzz about companies waking up to the agility of an open source solution. Most recently, with the acquisition of Autonomy by HP, the industry is telling stories of ex Autonomy customers migrating to Solr (open source search) with only the annual licencing budget to finance the migration. Without an annual expenditure of cash for licensing, and the freedom of not being under a licensing agreement, companies quickly recoup the initial expenditure of a migration.

What kind of car does your company drive?

If you are examining the different choices for implementing search technology in your organization, contact us.  We’re happy to talk to you about the best solution for your business.


Migration Still Looms Large on the Horizon for FAST ESP Customers

“Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation.” –Michael McIntosh, VP of Search Technologies, TNR Global, LLC

Microsoft acquired FAST all the way back in 2008 and then in early 2010 disclosed it’s plans to stop updating the FAST product on a Linux operating system after 2010, making FAST ESP 5.3 the latest and greatest, and very last update Linux users will see involving any improvements to the proprietary search platform. It was clear to anyone on Linux that a migration would need to occur, and as content grows, depending upon the size of your organization, that migration should probably happen sooner than later.

Buzz about migration ensued–an inevitable certainty for many companies, especially ones with huge amounts of data. But how many companies have jumped in with both feet? I had the opportunity to speak with an open source search engine expert who, along with the industry, believed that the move from Microsoft was a windfall for anyone in the business of enterprise search design and implementation. However, she admitted “we haven’t seen as large a response as we expected.”

This isn’t exactly surprising to everyone. “It’s coming” says our VP of Search Technologies, Michael McIntosh. “Corporations have an enormous investment in FAST ESP and it makes sense that they would be reluctant to move to something new until they absolutely have to.” That means, when their licenses expire.

“They will likely weigh the performance and support, or lack thereof, for the FAST ESP technical team with the timing of renewing a license and wait until they absolutely have to change to something else,” says McIntosh.

The purchase of Autonomy and the shift of HP from hardware to software could signal a recognition from Goliath HP the kind of growth opportunity enterprise search software offers, and that the “great shift” from FAST ESP to another search platform is very much on the horizon.

But as the clock continues to tick, companies using FAST ESP should be strategizing for migration now. “It’s an enormous undertaking to migrate an entire search solution from FAST to another platform. Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation. Solving challenging issues like that requires both creativity and expertise to address your needs.” says McIntosh. If a need for migration is eminent, there will be a real need for expertise in the field of enterprise search on both proprietary and open source platforms, depending upon several factors like size, in house talent, and growth expectations.

How is your company preparing for the discontinuation of support of FAST ESP?  Need guidance?  Contact us for pointers, analysis, or architecture for a full migration.