Continuous Integration for Large Search Solutions

Managing large projects takes a smart approach and some intuitive thinking. One project we are currently engaged in is with large publisher of manufacturing parts. This has been an extraordinary project due to its scale and ever changing scope. I spoke with our VP of Enterprise Search Technologies, Michael McIntosh about how TNR Global handles complex projects.

Karen: This project is a big one. Tell me more about the site’s function. What is the focus?

Michael: Product search is the focus. The site contains tens of millions of documents, both structured and unstructured content. They also have a huge amount of data provided by the advertisers and the companies themselves on products that they sell. One of the advantages we have over a search engine like Google is access to a vast amount of propriety data provided by the vendors themselves.

Karen: Tell me about how you are managing the project.  What are some of the variables you work with?

Michael: With this particular project, we are dealing with many different data feeds. There are many different intermediary metadata stages we have to generate to support the final searchable content.  The client also changes their business logic frequently enough that if it takes a month or more between data builds its likely something has changed. For instance, they might have changed an XML format or added an attribute to an element in the data feed that will break something else down the line. The problem is there are so many moving parts, it’s almost impossible to do it manually and always do it correctly.

Karen: What other kinds of business logic changes are you dealing with in top of the massive amounts of raw data?

Michael: Most of the business logic changes are when they need to modify how something behaves based on new data that’s available, or when they need to start treating the data in a different way.  Sometimes there is a change in the way they want the overall system to behave. They sometimes have some classification rules for content they like to tweak occasionally.

Another thing we consider is the client’s relevancy scoring and query pre-processing rules. So you need to consider if you issue a query and it fails, what happens then? What kind of fallback query do you use?  All these things are part of the business logic that is independent of the raw data. In summary, we have the raw data and we can do a number of things with it. They often want us to change exactly what we’re doing with it, how we’re conditioning it, and how we’re transforming it. We either tweak what exists or take advantage of new data that they’ve started including in their data feeds. The challenge is all these elements can change frequently.

Karen: This site is more of a portal than strictly an enterprise search project, isn’t it?

Michael: Yes. Enterprise search usually refers to searching for documents within an organization. This client is a public facing search engine that allows the public to perform product search across a very large number of vendors and service providers.

Changes come from their advertisers and data they provide. Advertisers come and go. People pay for placement within certain industrial categories. It’s not like we get a static list of sites to crawl and that’s that. It changes weekly, sometimes daily. This list of sites we crawl is on a weekly or daily basis. Also things need to be purged from the index. Say an advertiser’s contract ends and suddenly we need to stop crawling a site with thousands documents; that data needs to be purged from the index promptly. Not only do we have to crawl new sites but purge old ones as well. This is a project that is so massive that it’s not cut and dried. A lot of software development projects focus on a clear cut problem, come up with plan, tackle it, release it, and then maintain it. We’re constantly getting new information and learn new things about people hitting the site.

Karen: So this sounds like this project is always in a state of ongoing development.

Michael: We are building something that’s never been built before. One of the goals is to make this site remarkable. And we’re very excited to be a part of that. The scale of the project is quite big though, which is why we started using Continuous Integration.

The way our cycles work is we perform big data updates, but by using CI, we can continuously update and integrate new data. We’re moving to a place, by using the practice of CI, we can perform a daily builds which gives us the time we need to fix problems before we absolutely need it to be live.

Karen: How do you implement CI into your day to day management of the project?

Michael: There are some pretty great open source tools that we’re using to implement CI. We use Jenkins to help us do Continuous Integration for frequent data builds, which is an intensive process for this particular client.

We field questions from the client about the status of different data builds. We hope to use Jenkins in conjunction with other tools to automatically build data and have event-based data builds. We’re looking at a way to have it triggered by some other event and have Jenkins automatically generate reports as the data is being built. Each time we run a build script, if the output differs from the previous build, Jenkins makes it easy for you to see that something is different. There is a way to modify your output that Jenkins can understand.  One of the cool things about Jenkins is they have graphs that illustrate differences to help us identify issues that could pose a potential problem and let us fix it before we need to go live with the data.

Karen: Any other tools?

Michael: For multi-node search clusters, we’re using a tool called fabric3 that uses SSH to copy data and execute scripts across multiple nodes of a cluster based upon roles. We have a clever set up where we’re able to inform fabric3 what services are running on each node in our cluster and have actions or commands linked to certain tasks, like building metadata.  By linking them, they automatically know which nodes to deploy data to.

Using open source tools like Jenkins and fabric3 make it a lot more manageable considering the large number of moving parts. It’s allowed us to be successful in building this incredible site and making the search function relevant, accurate and up to date.

Migration from Microsoft FAST to Apache Lucene Solr

Is your company using Microsoft FAST ESP on a Linux platform?  Unfortunately, Microsoft announced in 2010 they will cease technical support for FAST ESP 5.3 after it’s 5 year life cycle for anyone using Linux as their operation system. Migration to another search platform will be a priority, and business leaders and technology professionals are looking closely at Lucene Solr as a solution.

We can assist your organization in any stage of a migration. We can perform an evaluation of your current architecture, draft a plan for migration, work with your internal team on the migration or just consult as needed. Whatever your specific needs are, we can help you achieve your goals. Read our White Paper released February 2012 that presents a Case Study on migration. The paper discusses:

  • Loading millions of documents into Solr indexes
  • Evaluation and recommendations for tools to bridge the features gap
  • Migrating custom pipeline code to Pypes with minimal changes
  • Proven ROI after a complete migration

Additionally, we have presented on the subject of FAST ESP to Lucene Solr migrations for the Lucene Revolution Conference in Boston, MA (2010 Slides: Migration from FAST ESP to Lucene Solr (PDF) (pdf:4,067,091) ) and at the Apache Lucene Eurocon (web site dead) Barcelona (October 2011). Watch our VP of Search Technology Michael McIntosh’s presentation on FAST to Lucene Solr Migration below. If you like what you see, contact us to explore a Solr migration solution.


                                                                                                                     

Slide presentation 

From Microsoft FAST to Lucene/Solr – Barcelona

Fast_ESP_to_Lucene_SolrTNR Global presented at the Apache Lucene Eurocon in Barcelona, Spain. Michael McIntosh, VP of Enterprise Search Technologies, spoke on the migration from Microsoft FAST ESP to Lucene/Solr open source search.

View the presentation
Migration from Microsoft FAST to Apache Lucene Solr
.

Our White Paper on Microsoft FAST ESP to Lucene/Solr will be released in January, 2012.  To receive your free White Paper, email contact information to fast2solr@tnrglobal.com or subscribe here to receive the White Paper and our newsletter on FAST to Lucene Solr Migration.

Solr in the Cloud – SolrHQ

my_Logo-SolrHQ1

Use the power of Solr search for your website with SolrHQ, a hosted Solr search solution that is easy to set up and scales with your growth.  We offer plans to meet the needs of any sized organization.  SolrHQ runs in the cloud for 24/7 up time reliability.  We offer free tech support and free set up when you sign up for SolrHQ. 

This is a cloud based service that can be used for searching an online site or a private (enterprise level) network.  The service can handle anywhere from a few thousand documents/pages into the millions. SolrHQ services to widely used Content Management Systems WordPress and Joomla!  Our easy to deploy plug-in for your CMS can bring truly powerful search to your site, allowing your users to find and discover the content you’ve worked hard to create.

To get started, sign up at SolrHQ for a free account.  The technology is free to anyone who opens an account.  If you need assistance with set up or installation, contact us. A member of our team will schedule a time to review the service, discuss the short set up process, and schedule the SolrHQ installation for your site. The technology is free, and set up is for a small fee depending upon the size of your data set.  Contact us for more information.

Legal: Lucene is an open source search engine project of the Apache Software Foundation. Per ASF: “Apache Lucene is a high-performance, full-featured text search engine software library written entirely in Java.” Solr is a subproject of the Lucene project that created a fully featured search engine. Per ASF: “Apache Solr is a software product that provides an enterprise search server and services based on Apache Lucene.” Lucene and Solr are trademarks of the Apache Software Foundation.

 

 

IBM OmniFind

logo_ibm

IBM OmniFind Enterprise Edition powers secure intranets, corporate public Web sites,  and information extraction applications.  The IBM OmniFind family of products includes IBM OmniFind Analytics Edition, Discovery Edition, Enterprise Edition, Enterprise Starter Edition, and Yahoo! Edition. IBM OmniFind Yahoo! Edition is a free (no licensing fees) entry-level enterprise search solution that provides a low cost way to start implementing enterprise search in your organization.

TNR developed an enterprise search component for the open source Joomla! content management system. The TNR ESearch Component integrates IBM OmniFind Yahoo! Edition with Joomla! based websites. The component is designed to provide enterprise search for an intranet or extranet, as well as corporate information to clients visiting a public website. In addition to providing this component under the GPL license to the larger community, TNR has implemented ESearch at a number of public websites, including the website of Arbor Networks, a global provider of solutions for network security and visibility.
See an example of the IBM OmniFind Yahoo! Edition search implementation by TNR Global at Arbor Networks.

Lucene Solr

solr_FC1Solr is an open source high performance, scalable, cross-platform search engine. Solr has features which were once only found in commercial offerings.

lucene_logo

We work with Lucene Solr, a Java-based open source search library, to power search for web sites.

  • Automatic replication for large installations with distributed search
  • Java-API (SolrJ)
  • Conversion of Office-documents
  • Full faceted search
  • Advanced tokenization, highlighting and stemming

 

With open source solutions, there are no licensing fees, and the technology can be customized to meet your specific needs. Solr powers search for large commercial organizations including sites at eBay, MTV Networks, Netflix, CNET, and Zappos.

We leverage the power of Lucene Solr combined with the latest content enhancement approaches to provide more diversified search service offerings for clients.

Years of hands-on experience gives us an advantage; we understand  the subtle nuances of the data.

Lucene Solr Services

solr_FC1Lucene

Solr is a powerful open source, scalable, cross-platform search engine. Solr has high performance features comparable to proprietary search engines like faceted search, full text search, rich document handling and dynamic clustering. TNR Global is a regular presenter at Lucene Solr Conferences worldwide and an active member of the Solr open source community. Since Solr is open source technology, the source code is free. Contact us to implement and integrate this robust search engine into your organization.

TNR Global offers Lucene Solr consulting and integration services for:

  • Software or SaaS System developers
  • IT/MIS system administrators
  • Corporate data administrators
  • Current Linux based FAST ESP users
  • Marketing departments of content intensive web sites

Our Services with Lucene Solr:

We integrate solutions using Lucene Solr for commercial grade Lucene Solr products through our partners at Lucid Imagination.  We also develop tailor made solutions using Lucene Solr for the following:lucid_imagination_logo

  • Crawling web resources: pages and documents, forums, blogs
  • Content processing and conversion
  • Content enhancement and extraction
  • PDF search by page
  • Alternative search for SharePoint, email
  • Search and database integration
  • Audits and Upgrades to your current Solr installation

We leverage the power of Lucene Solr combined with the latest content enhancement approaches to provide more diversified search service offerings for clients. Years of hands-on experience give us an advantage; we understand the issues and the subtle nuances of data.  Contact us for a free consultation.

Microsoft FAST ESP

logo-FASTesp

TNR Global employs Microsoft FAST ESP (Enterprise Search Platform).  Our implementations support custom and standard formats such as text, HTML, XML, and PDF. We have configured, mirrored, scaled and maintained the ESP system in a rigorous production environment both for Linux and SharePoint.

At TNR Global, we implement and customize the Microsoft FAST Search solution to empower our customers to reach their business goals.

Our expertise with Fast ESP

  • Configuring, mirroring and scaling Microsoft FAST ESP systems using various architectural layouts
  • Custom document tagging pipeline stage development for associating database content with web content based upon document URL
  • Custom dependency-based content build and feeding systems
  • Access to low-level undocumented ESP XML-RPC APIs for better integration
  • ESP benchmarking and performance tuning
  • FAST Index XML repartitioning tools for content volume scaling
  • Proven ESP content backup & recovery techniques
  • Handling of extensive or unplanned system changes without impacting service availability
  • Ajax / Web 2.0 ESP-Suggest functionality integration that uses actual ESP query logs
  • Seamless handling of hardware failure through service mirroring and failover modes
  • Expertise with low-level search engine architecture and search/relevancy algorithms

See an example of the Microsoft FAST ESP Search implementation by TNR Global and CMG at ThomasNet.

TNR has worked with the FAST ESP product since 2004, from version FAST Data Search (FDS) 3.2 up through version FAST ESP 5.3. In 2007, FAST Search and Transfer was acquired by Microsoft. It is Microsoft’s plan to use powerful FAST style technology for their public search engine, Bing. FAST’s flexible and scalable enterprise search platform elevates the search capabilities of enterprise customers and connects people to the relevant information they seek regardless of medium. This drives revenues and reduces total cost of ownership by effectively leveraging IT infrastructure. FAST ESP is known for its scalability, relevancy, and reliability. More than 2,600 customers worldwide use FAST solutions. Contact us for a free consultation.

Google Search Appliance

imagesGSAGoogle Search Appliance (GSA) is Google’s answer to search for the enterprise. Literally an “out-of-the-box” solution, the GSA is a piece of rack mounted hardware designed for easy installation and deployment. The Google Search Appliance (GSA) provides fast, relevant search for your intranet or website. The Google Search Appliance lives on-premise and provides your organization with a high relevancy customizable, scalable search solution.

TNR Global can provide an appliance solution comparable to GSA with similar or superior capabilities. Contact us to assess your needs and schedule a deployment with a packaged appliance solution.

Enterprise Search White Papers and Presentations

Below are links to our mini white papers, addressing questions about enterprise search and more.

Enterprise Search Basics (pdf:157,797) Enterprise Search and Government (pdf:112,083) Enterprise Search for Law Firms (pdf:110,350) Enterprise Search and E-Discovery (pdf:108,855)

Please contact us for additional information.

Migration from FAST ESP to Lucene Solr, by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global

Video from the Lucene Revolution conference in Boston, MA.

from-fast-esp-to-solr

Migration from FAST ESP to Lucene Solr (PDF) (pdf:4,067,091)
presented by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global at the Lucene Revolution Conference in Boston, MA.
There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.
This presentation compares Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases are presented describing how to map the various functions between systems.