Open Source Search Engines vs. Proprietary Search Engines

“If you are a cutting edge company, you will be severely limited by a proprietary search engine as a solution. The more open the technology, the more able we are to refine it to meet our client’s needs.” –Michael McIntosh, TNR Global

There are plenty of articles about the pros and cons of open source software vs. proprietary software. I sat down with our VP of Search Technologies Michael McIntosh to discuss the benefits of each in terms of search engines.

Karen: You’ve been working with proprietary search engines for some time now. Tell me your thoughts on that. What’s the upside of proprietary?

Michael: I’ve worked with proprietary search engines for several years, specifically with FAST ESP since 2003 back when it was known as FAST Data Search (FDS). Proprietary software products often have better documentation, better support; more thought out design and are more aggressively tested. Because the product supports an entire company—it must succeed. They have nicer tools, nicer interfaces.

Karen: And the downsides?

Michael: “Over the years, we’ve run into a number of difficulties with proprietary search engines. One thing that comes up is that if a problem isn’t outlined in user troubleshooting documentation, it can become incredibly difficult to diagnosis and correct, and doing that is a frustrating, time intensive problem. The black box nature of the product is very limiting. If it’s not in documentation, it might as well not exist at all.”

Karen: There are gaps in the user manual?

Michael: “Yes. But in defense of FAST ESP, the documentation has improved by leaps and bounds over the years. However, one anomaly we find is that the clean, easy to read PDF form of user documentation (the original) for ESP is often not as up-to-date or helpful as the searchable online documentation—which is harder to read, but usually more current and correct. Sometimes even the online documentation is wrong—which is also frustrating. But it has become something we cope with regard to FAST”

Karen: Give me an example of what kind of problems you run into when integrating the proprietary search engine into a client’s website.

Michael: “The enterprise search platform uses Service Oriented Architecture (SOA). That is a bunch of different components that are able communicate with each other as services. With this type of architecture, it doesn’t matter which language you write something in as long as it’s a service another language can communicate with via RESTful interfaces, SOAP or something like XML-RPC. These services can all work together, despite the fact you don’t have a unified api—and that’s actually awesome.

Karen: Why?

Michael: “Indexers for one—you generally don’t want them written in an interpreted language for performance reasons. Indexing can be a CPU intensive operation, which can be a weak spot for interpreted languages such as Python or Ruby compared to languages like C/C++ or Java. It is both CPU and disk intensive process so a scripting language can be great if you’re writing an application that’s not CPU intensive because code speed doesn’t matter so much. The true slow-down is something outside the script. You can optimize the speed of the script to make it run as fast as humanly possible but you’re limited because the disk can only rotate so fast. Your indexing service can be written in a low level language like C vs. some of the other services in languages like Python or Ruby or Java and get good performance. BUT if you don’t have documentation for compiled programs that make up the search engine product you’re going to have a terrible time trying to figure out how to fix issues when they arise.

Karen: “So basically, because you have so many different languages being used that you lack the source code for, and documentation is spotty, it becomes a needle in a haystack trying to figure out where the problem occurs.”

Michael: “Yes. And this is not such a problem for anyone using a search engine for really basic applications. The place where you run into problems is if you push the search engine technology to its limits or you are using it in ways outside its typical usage, which is almost everything we do. We are always trying to get the best possible performance out of the search engine. We’re trying to get the search engine to deal with features we need but it doesn’t natively support.

Karen: “What is it you’re pushing specifically for the search engine to do?”

Michael: “One thing we want is for FAST ESP to have is a feature to deal with creating a faceted search for arbitrary fields. The way ESP works is that it has an index profile which is a statically defined set of fields that it indexes. Inside its index profile you can mark certain fields to have navigators. One of our customers deals with product verticals. They have a whole bunch of products that aren’t unified—all with completely different attributes. We’ve managed to work around these roadblocks in ESP to create faceted navigation on arbitrary fields.

Karen: “So you get creative to make it better.”

Michael: “Yes we constantly get creative to make it better to use its strengths and find ways to work around its limitations.”

“Another issue we have with ESP is we have a number of websites we need crawled and each website has metadata associated with it. Unfortunately the way the ESP crawler works, there are not many straightforward ways to preserve metadata associated with the seed URL which we use to crawl a website and pass the meta information along to any associated links. We can’t do this easily inside the ESP crawler. Since its proprietary and black box, we can’t look at the source code to the crawler, and can’t modify the source code to the crawler. When it does something mysterious, we can have no idea why it might be behaving in an unwanted way. We had one instance when we had a number of websites the crawler was temporarily blacklisting for some reason. When the ESP crawler automatically blacklists a site, it stops crawling the site for 30 minutes and then begins again after 30 minutes. We learned one thing that triggers blacklisting is if a website has HTTP 503 errors. If a site has more than 20 or so of those errors, the crawler temporarily blacklists the site. The problem was that the documentation is too sparse on details for that topic. When we ran into that problem—it was really difficult to know what was going on so we could properly explain the issue to the client and address the problem. Conversely, if we had an open source search engine, I could have just searched the source code and speed up the diagnosis of the problem.”

Karen: So from a business prospective—using open source allows you to invest in a technology that gives you the power to modify the code to better meet your business needs.

Michael: “It certainly can. It can accelerate development time and speed of diagnosing problems when issues pop up. And issues always do pop up. If something is not working very well, we can look at the problem which a much higher degree of granularity.

If it’s a simple problem, we never contact support. We only contact support when we’re stumped. And we’re not easily stumped. Usually, they can’t answer they question immediately because if we’re asking for help, the problem is complex. Our ticket is escalated, and eventually we talk to someone who can help us. But it does take time. Even if a support staff is top notch, there is the time is costs to deal with that, and that costs us and our clients’ time. We have a highly customized ESP installation for one of our clients it always take an enormous amount of time to explain over and over how we have our systems set up, the different parts work, and it’s a big pain to go through that every time I run across a problem. If it were open source, I can simply look at the source code and solve the problem.”

Karen: Let’s talk in more detail about open source search engines. Upsides?

Michael: “If you choose a popular open source search engine solution like Lucene Solr, you have an active, passionate community behind that solution. There are several developers looking at that engine, working on it, and actively posting in publicly available forums. You can often get your questions answered there by top notch experts in search technology. You can potentially talk to the original coders and creators of the product—and they are often happy to help you. I’ve seen people post a Solr question on their twitter feed and within 7 seconds, the creator has responded with a link to a forum explaining the solution.

Karen: Wow, that’s amazing. Other advantages?

Michael: “It’s free. That’s attractive to most companies. The downside is the formal documentation isn’t usually as good as the proprietary, and there isn’t a dedicated support team for the product. But if you have some savvy software developers on your team, the open source community is robust and willing to share information about the product. And having access to the source code is extremely valuable.”

Karen: So in your opinion, what’s the bottom line on Open Source Search Engines vs. Proprietary Search Engines?

Michael: “If you are a cutting edge company, you will be severely limited by a proprietary search engine as a solution. The more open the technology, the more able we are to refine it to meet our client’s needs.”

If you’d like more information on the pros and cons of Open Source Search Engines vs. Proprietary Search Engines that are specific to your business or organization’s needs, feel free to contact us for a free consultation.

TNR Global to present at Apache Lucene Eurocon 2011 in Barcelona

We are happy to announce that TNR Global’s own Michael McIntosh will be presenting at the Apache Lucene Eurocon 2011 in Barcelona this October.  Michael’s talk is titled Enterprise Search:  FAST ESP to Lucene Solr.” His presentation will discuss migration from the FAST ESP platform to a Lucene Solr search platform. There are many reasons an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users. Illustrated through actual case studies, the presentation will include challenges and concerns, present solutions and work-arounds to overcome migration issues.

Michael has more than 16 years of experience in large scale systems design and operation, online consumer product development, high volume transaction processing and engineering management. He has extensive experience developing, integrating and maintaining search technology solutions for companies such as FAST Search and Lycos.

We’re excited that Michael will be presenting in Barcelona this fall.  Please introduce yourself if you’re able to go!

Fast ESP Error: no doc procs registered to process a batch with priority 0

Just wanted to take this error message off of the, “Hey, we’ve seen this before… now how did we resolve this..?” pile.  This is the full text of the error:

WARNING    Could not send batch to ESP content distributor, will retry automatically.
Reason given: process() failed: exception (no_resources) no doc procs registered to 
process a batch with priority 0

At first glance, it looks pretty clear that you just need to [re]start your document processor(s).  However, this won’t necessarily solve the problem.  Turns out that the a likely reason for this to pop up is a bad Document Processing Pipeline (DPP) Stage.  The docprocs fire up, hit the bad stage (e.g. python errors etc.) and don’t recover.

To debug your DPP Stage, take a look at the logs for the document processor(s).  They’re usually located in $FASTSEARCH/var/log/procserver and, in our experience, there’s probably an uncaught python exception lurking somewhere in there.

Integrate custom services with the Fast ESP Node Controller

Add your own services to ESP's Node Controller
Add your own services to ESP's Node Controller

Background

Integrating our own custom services with Fast ESP’s Node Controller provides us with several benefits:

  • Administrators without in-depth ESP knowledge can easily control services (e.g. start, stop, configure parameters)
  • Services can be started at boot time with the rest of ESP
  • espdeploy can be used to install our services in a multi-node cluster

The components required for this system are:

  1. The ESP Node Controller (with config file NodeConf.xml)
  2. The 3rd-party service (like a CherryPy server, log parser, etc.)
  3. A wrapper script (see below)

Steps for Integration

  1. Define the service you would like to integrate. It can be any script or binary that can be executed on the system. For example, the service might be a python script that takes command-line arguments and continues running itself (as is the case with a webserver).
  2. Create the wrapper script that sets up the proper environment and runs/stops the service properly. The wrapper script should be put in the $FASTSEARCH/bin directory (with executable permissions). Additionally, the wrapper script should pass $@ to your actual script so any/all arguments defined in $FASTSEARCH/etc/NodeConf.xml will be passed along properly from the Node Controller to your service. The following is an example of a wrapper script:
    #!/bin/sh
    
    # export the proper python path
    export PYTHONPATH=":/path/to/python"
    
    # run the script (backgrounded)
    python $FASTSEARCH/lib/python2.6/yourmodule/yourservice.py $@ &
    
    # determine the process id of the python script
    SCRIPT_PID="$!"
    
    # upon receiving a SIGTERM, forward it to the process
    trap "kill -TERM $SCRIPT_PID" SIGTERM
    
    # wait for SIGTERM from nctrl
    wait
  3. Define the service in $FASTSEARCH/etc/NodeConf.xml
    Add the following to the end of the <startorder> tag:

    <proc>servicename</proc>

    Add the following to the end of the <node> tag, customizing as appropriate:

    <!-- My Custom Service -->
    <process name="servicename" description="My Custom Service">
            <start>
                    <executable>binaryname</executable>
                    <parameters>-p 16940 -v</parameters>
                    <port base="4004"/>
            </start>
            <outfile>servicename.scrap</outfile>
    </process>
  4. Reload the Node Controller configuration with the following:
    nctrl reloadcfg

And that’s it!  Now you should be able to start, stop, configure, and deploy your services using Fast ESP tools.  Enjoy!

FAST ESP Overview

We use FAST ESP to power a large industrial search engine listing over 1 million companies and over 3 million indexed documents and receiving millions of visitors every month. I have been working with ESP since 2003 (then known as FDS 3.2).

FAST ESP is extremely flexible and can deal with indexing many document types (html, pdf, word, etc). It has a very robust crawler for web documents and you can use their intermediary FastXML format to load custom document formats into the system or use their Content APIs. Continue reading “FAST ESP Overview”