Open Source Search Engines vs. Proprietary Search Engines

“If you are a cutting edge company, you will be severely limited by a proprietary search engine as a solution. The more open the technology, the more able we are to refine it to meet our client’s needs.” –Michael McIntosh, TNR Global

There are plenty of articles about the pros and cons of open source software vs. proprietary software. I sat down with our VP of Search Technologies Michael McIntosh to discuss the benefits of each in terms of search engines.

Karen: You’ve been working with proprietary search engines for some time now. Tell me your thoughts on that. What’s the upside of proprietary?

Michael: I’ve worked with proprietary search engines for several years, specifically with FAST ESP since 2003 back when it was known as FAST Data Search (FDS). Proprietary software products often have better documentation, better support; more thought out design and are more aggressively tested. Because the product supports an entire company—it must succeed. They have nicer tools, nicer interfaces.

Karen: And the downsides?

Michael: “Over the years, we’ve run into a number of difficulties with proprietary search engines. One thing that comes up is that if a problem isn’t outlined in user troubleshooting documentation, it can become incredibly difficult to diagnosis and correct, and doing that is a frustrating, time intensive problem. The black box nature of the product is very limiting. If it’s not in documentation, it might as well not exist at all.”

Karen: There are gaps in the user manual?

Michael: “Yes. But in defense of FAST ESP, the documentation has improved by leaps and bounds over the years. However, one anomaly we find is that the clean, easy to read PDF form of user documentation (the original) for ESP is often not as up-to-date or helpful as the searchable online documentation—which is harder to read, but usually more current and correct. Sometimes even the online documentation is wrong—which is also frustrating. But it has become something we cope with regard to FAST”

Karen: Give me an example of what kind of problems you run into when integrating the proprietary search engine into a client’s website.

Michael: “The enterprise search platform uses Service Oriented Architecture (SOA). That is a bunch of different components that are able communicate with each other as services. With this type of architecture, it doesn’t matter which language you write something in as long as it’s a service another language can communicate with via RESTful interfaces, SOAP or something like XML-RPC. These services can all work together, despite the fact you don’t have a unified api—and that’s actually awesome.

Karen: Why?

Michael: “Indexers for one—you generally don’t want them written in an interpreted language for performance reasons. Indexing can be a CPU intensive operation, which can be a weak spot for interpreted languages such as Python or Ruby compared to languages like C/C++ or Java. It is both CPU and disk intensive process so a scripting language can be great if you’re writing an application that’s not CPU intensive because code speed doesn’t matter so much. The true slow-down is something outside the script. You can optimize the speed of the script to make it run as fast as humanly possible but you’re limited because the disk can only rotate so fast. Your indexing service can be written in a low level language like C vs. some of the other services in languages like Python or Ruby or Java and get good performance. BUT if you don’t have documentation for compiled programs that make up the search engine product you’re going to have a terrible time trying to figure out how to fix issues when they arise.

Karen: “So basically, because you have so many different languages being used that you lack the source code for, and documentation is spotty, it becomes a needle in a haystack trying to figure out where the problem occurs.”

Michael: “Yes. And this is not such a problem for anyone using a search engine for really basic applications. The place where you run into problems is if you push the search engine technology to its limits or you are using it in ways outside its typical usage, which is almost everything we do. We are always trying to get the best possible performance out of the search engine. We’re trying to get the search engine to deal with features we need but it doesn’t natively support.

Karen: “What is it you’re pushing specifically for the search engine to do?”

Michael: “One thing we want is for FAST ESP to have is a feature to deal with creating a faceted search for arbitrary fields. The way ESP works is that it has an index profile which is a statically defined set of fields that it indexes. Inside its index profile you can mark certain fields to have navigators. One of our customers deals with product verticals. They have a whole bunch of products that aren’t unified—all with completely different attributes. We’ve managed to work around these roadblocks in ESP to create faceted navigation on arbitrary fields.

Karen: “So you get creative to make it better.”

Michael: “Yes we constantly get creative to make it better to use its strengths and find ways to work around its limitations.”

“Another issue we have with ESP is we have a number of websites we need crawled and each website has metadata associated with it. Unfortunately the way the ESP crawler works, there are not many straightforward ways to preserve metadata associated with the seed URL which we use to crawl a website and pass the meta information along to any associated links. We can’t do this easily inside the ESP crawler. Since its proprietary and black box, we can’t look at the source code to the crawler, and can’t modify the source code to the crawler. When it does something mysterious, we can have no idea why it might be behaving in an unwanted way. We had one instance when we had a number of websites the crawler was temporarily blacklisting for some reason. When the ESP crawler automatically blacklists a site, it stops crawling the site for 30 minutes and then begins again after 30 minutes. We learned one thing that triggers blacklisting is if a website has HTTP 503 errors. If a site has more than 20 or so of those errors, the crawler temporarily blacklists the site. The problem was that the documentation is too sparse on details for that topic. When we ran into that problem—it was really difficult to know what was going on so we could properly explain the issue to the client and address the problem. Conversely, if we had an open source search engine, I could have just searched the source code and speed up the diagnosis of the problem.”

Karen: So from a business prospective—using open source allows you to invest in a technology that gives you the power to modify the code to better meet your business needs.

Michael: “It certainly can. It can accelerate development time and speed of diagnosing problems when issues pop up. And issues always do pop up. If something is not working very well, we can look at the problem which a much higher degree of granularity.

If it’s a simple problem, we never contact support. We only contact support when we’re stumped. And we’re not easily stumped. Usually, they can’t answer they question immediately because if we’re asking for help, the problem is complex. Our ticket is escalated, and eventually we talk to someone who can help us. But it does take time. Even if a support staff is top notch, there is the time is costs to deal with that, and that costs us and our clients’ time. We have a highly customized ESP installation for one of our clients it always take an enormous amount of time to explain over and over how we have our systems set up, the different parts work, and it’s a big pain to go through that every time I run across a problem. If it were open source, I can simply look at the source code and solve the problem.”

Karen: Let’s talk in more detail about open source search engines. Upsides?

Michael: “If you choose a popular open source search engine solution like Lucene Solr, you have an active, passionate community behind that solution. There are several developers looking at that engine, working on it, and actively posting in publicly available forums. You can often get your questions answered there by top notch experts in search technology. You can potentially talk to the original coders and creators of the product—and they are often happy to help you. I’ve seen people post a Solr question on their twitter feed and within 7 seconds, the creator has responded with a link to a forum explaining the solution.

Karen: Wow, that’s amazing. Other advantages?

Michael: “It’s free. That’s attractive to most companies. The downside is the formal documentation isn’t usually as good as the proprietary, and there isn’t a dedicated support team for the product. But if you have some savvy software developers on your team, the open source community is robust and willing to share information about the product. And having access to the source code is extremely valuable.”

Karen: So in your opinion, what’s the bottom line on Open Source Search Engines vs. Proprietary Search Engines?

Michael: “If you are a cutting edge company, you will be severely limited by a proprietary search engine as a solution. The more open the technology, the more able we are to refine it to meet our client’s needs.”

If you’d like more information on the pros and cons of Open Source Search Engines vs. Proprietary Search Engines that are specific to your business or organization’s needs, feel free to contact us for a free consultation.

Leave a Reply

Your email address will not be published. Required fields are marked *