Migration Still Looms Large on the Horizon for FAST ESP Customers

“Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation.” –Michael McIntosh, VP of Search Technologies, TNR Global, LLC

Microsoft acquired FAST all the way back in 2008 and then in early 2010 disclosed it’s plans to stop updating the FAST product on a Linux operating system after 2010, making FAST ESP 5.3 the latest and greatest, and very last update Linux users will see involving any improvements to the proprietary search platform. It was clear to anyone on Linux that a migration would need to occur, and as content grows, depending upon the size of your organization, that migration should probably happen sooner than later.

Buzz about migration ensued–an inevitable certainty for many companies, especially ones with huge amounts of data. But how many companies have jumped in with both feet? I had the opportunity to speak with an open source search engine expert who, along with the industry, believed that the move from Microsoft was a windfall for anyone in the business of enterprise search design and implementation. However, she admitted “we haven’t seen as large a response as we expected.”

This isn’t exactly surprising to everyone. “It’s coming” says our VP of Search Technologies, Michael McIntosh. “Corporations have an enormous investment in FAST ESP and it makes sense that they would be reluctant to move to something new until they absolutely have to.” That means, when their licenses expire.

“They will likely weigh the performance and support, or lack thereof, for the FAST ESP technical team with the timing of renewing a license and wait until they absolutely have to change to something else,” says McIntosh.

The purchase of Autonomy and the shift of HP from hardware to software could signal a recognition from Goliath HP the kind of growth opportunity enterprise search software offers, and that the “great shift” from FAST ESP to another search platform is very much on the horizon.

But as the clock continues to tick, companies using FAST ESP should be strategizing for migration now. “It’s an enormous undertaking to migrate an entire search solution from FAST to another platform. Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation. Solving challenging issues like that requires both creativity and expertise to address your needs.” says McIntosh. If a need for migration is eminent, there will be a real need for expertise in the field of enterprise search on both proprietary and open source platforms, depending upon several factors like size, in house talent, and growth expectations.

How is your company preparing for the discontinuation of support of FAST ESP?  Need guidance?  Contact us for pointers, analysis, or architecture for a full migration.

Open Source Search Engines vs. Proprietary Search Engines

“If you are a cutting edge company, you will be severely limited by a proprietary search engine as a solution. The more open the technology, the more able we are to refine it to meet our client’s needs.” –Michael McIntosh, TNR Global

There are plenty of articles about the pros and cons of open source software vs. proprietary software. I sat down with our VP of Search Technologies Michael McIntosh to discuss the benefits of each in terms of search engines.

Karen: You’ve been working with proprietary search engines for some time now. Tell me your thoughts on that. What’s the upside of proprietary?

Michael: I’ve worked with proprietary search engines for several years, specifically with FAST ESP since 2003 back when it was known as FAST Data Search (FDS). Proprietary software products often have better documentation, better support; more thought out design and are more aggressively tested. Because the product supports an entire company—it must succeed. They have nicer tools, nicer interfaces.

Karen: And the downsides?

Michael: “Over the years, we’ve run into a number of difficulties with proprietary search engines. One thing that comes up is that if a problem isn’t outlined in user troubleshooting documentation, it can become incredibly difficult to diagnosis and correct, and doing that is a frustrating, time intensive problem. The black box nature of the product is very limiting. If it’s not in documentation, it might as well not exist at all.”

Karen: There are gaps in the user manual?

Michael: “Yes. But in defense of FAST ESP, the documentation has improved by leaps and bounds over the years. However, one anomaly we find is that the clean, easy to read PDF form of user documentation (the original) for ESP is often not as up-to-date or helpful as the searchable online documentation—which is harder to read, but usually more current and correct. Sometimes even the online documentation is wrong—which is also frustrating. But it has become something we cope with regard to FAST”

Karen: Give me an example of what kind of problems you run into when integrating the proprietary search engine into a client’s website.

Michael: “The enterprise search platform uses Service Oriented Architecture (SOA). That is a bunch of different components that are able communicate with each other as services. With this type of architecture, it doesn’t matter which language you write something in as long as it’s a service another language can communicate with via RESTful interfaces, SOAP or something like XML-RPC. These services can all work together, despite the fact you don’t have a unified api—and that’s actually awesome.

Karen: Why?

Michael: “Indexers for one—you generally don’t want them written in an interpreted language for performance reasons. Indexing can be a CPU intensive operation, which can be a weak spot for interpreted languages such as Python or Ruby compared to languages like C/C++ or Java. It is both CPU and disk intensive process so a scripting language can be great if you’re writing an application that’s not CPU intensive because code speed doesn’t matter so much. The true slow-down is something outside the script. You can optimize the speed of the script to make it run as fast as humanly possible but you’re limited because the disk can only rotate so fast. Your indexing service can be written in a low level language like C vs. some of the other services in languages like Python or Ruby or Java and get good performance. BUT if you don’t have documentation for compiled programs that make up the search engine product you’re going to have a terrible time trying to figure out how to fix issues when they arise.

Karen: “So basically, because you have so many different languages being used that you lack the source code for, and documentation is spotty, it becomes a needle in a haystack trying to figure out where the problem occurs.”

Michael: “Yes. And this is not such a problem for anyone using a search engine for really basic applications. The place where you run into problems is if you push the search engine technology to its limits or you are using it in ways outside its typical usage, which is almost everything we do. We are always trying to get the best possible performance out of the search engine. We’re trying to get the search engine to deal with features we need but it doesn’t natively support.

Karen: “What is it you’re pushing specifically for the search engine to do?”

Michael: “One thing we want is for FAST ESP to have is a feature to deal with creating a faceted search for arbitrary fields. The way ESP works is that it has an index profile which is a statically defined set of fields that it indexes. Inside its index profile you can mark certain fields to have navigators. One of our customers deals with product verticals. They have a whole bunch of products that aren’t unified—all with completely different attributes. We’ve managed to work around these roadblocks in ESP to create faceted navigation on arbitrary fields.

Karen: “So you get creative to make it better.”

Michael: “Yes we constantly get creative to make it better to use its strengths and find ways to work around its limitations.”

“Another issue we have with ESP is we have a number of websites we need crawled and each website has metadata associated with it. Unfortunately the way the ESP crawler works, there are not many straightforward ways to preserve metadata associated with the seed URL which we use to crawl a website and pass the meta information along to any associated links. We can’t do this easily inside the ESP crawler. Since its proprietary and black box, we can’t look at the source code to the crawler, and can’t modify the source code to the crawler. When it does something mysterious, we can have no idea why it might be behaving in an unwanted way. We had one instance when we had a number of websites the crawler was temporarily blacklisting for some reason. When the ESP crawler automatically blacklists a site, it stops crawling the site for 30 minutes and then begins again after 30 minutes. We learned one thing that triggers blacklisting is if a website has HTTP 503 errors. If a site has more than 20 or so of those errors, the crawler temporarily blacklists the site. The problem was that the documentation is too sparse on details for that topic. When we ran into that problem—it was really difficult to know what was going on so we could properly explain the issue to the client and address the problem. Conversely, if we had an open source search engine, I could have just searched the source code and speed up the diagnosis of the problem.”

Karen: So from a business prospective—using open source allows you to invest in a technology that gives you the power to modify the code to better meet your business needs.

Michael: “It certainly can. It can accelerate development time and speed of diagnosing problems when issues pop up. And issues always do pop up. If something is not working very well, we can look at the problem which a much higher degree of granularity.

If it’s a simple problem, we never contact support. We only contact support when we’re stumped. And we’re not easily stumped. Usually, they can’t answer they question immediately because if we’re asking for help, the problem is complex. Our ticket is escalated, and eventually we talk to someone who can help us. But it does take time. Even if a support staff is top notch, there is the time is costs to deal with that, and that costs us and our clients’ time. We have a highly customized ESP installation for one of our clients it always take an enormous amount of time to explain over and over how we have our systems set up, the different parts work, and it’s a big pain to go through that every time I run across a problem. If it were open source, I can simply look at the source code and solve the problem.”

Karen: Let’s talk in more detail about open source search engines. Upsides?

Michael: “If you choose a popular open source search engine solution like Lucene Solr, you have an active, passionate community behind that solution. There are several developers looking at that engine, working on it, and actively posting in publicly available forums. You can often get your questions answered there by top notch experts in search technology. You can potentially talk to the original coders and creators of the product—and they are often happy to help you. I’ve seen people post a Solr question on their twitter feed and within 7 seconds, the creator has responded with a link to a forum explaining the solution.

Karen: Wow, that’s amazing. Other advantages?

Michael: “It’s free. That’s attractive to most companies. The downside is the formal documentation isn’t usually as good as the proprietary, and there isn’t a dedicated support team for the product. But if you have some savvy software developers on your team, the open source community is robust and willing to share information about the product. And having access to the source code is extremely valuable.”

Karen: So in your opinion, what’s the bottom line on Open Source Search Engines vs. Proprietary Search Engines?

Michael: “If you are a cutting edge company, you will be severely limited by a proprietary search engine as a solution. The more open the technology, the more able we are to refine it to meet our client’s needs.”

If you’d like more information on the pros and cons of Open Source Search Engines vs. Proprietary Search Engines that are specific to your business or organization’s needs, feel free to contact us for a free consultation.

Setting JAVA_HOME on OSX lion

I recently upgraded to OSX 10.6.8 (Lion) and found that my JAVA_HOME was no longer set correctly. I found this out when my command line ec2 tools failed.

If you currently have your JAVA_HOME set to something like “/Library/Java/Home”,
under OSX Lion, you’ll want to change that to $(/usr/libexec/java_home), thus:

export JAVA_HOME=$(/usr/libexec/java_home)

The script /usr/libexec/java_home outputs the true location of JAVA_HOME, so unless that script goes away, this should prove upgrade safe.

I found this at http://steveswinsburg.wordpress.com/2011/07/22/java_home-on-os-x-lion/

Continuous Integration for Large Search Solutions

Managing large projects takes a smart approach and some intuitive thinking. One project we are currently engaged in is with large publisher of manufacturing parts. This has been an extraordinary project due to its scale and ever changing scope. I spoke with our VP of Enterprise Search Technologies, Michael McIntosh about how TNR Global handles complex projects.

Karen: This project is a big one. Tell me more about the site’s function. What is the focus?

Michael: Product search is the focus. The site contains tens of millions of documents, both structured and unstructured content. They also have a huge amount of data provided by the advertisers and the companies themselves on products that they sell. One of the advantages we have over a search engine like Google is access to a vast amount of propriety data provided by the vendors themselves.

Karen: Tell me about how you are managing the project.  What are some of the variables you work with?

Michael: With this particular project, we are dealing with many different data feeds. There are many different intermediary metadata stages we have to generate to support the final searchable content.  The client also changes their business logic frequently enough that if it takes a month or more between data builds its likely something has changed. For instance, they might have changed an XML format or added an attribute to an element in the data feed that will break something else down the line. The problem is there are so many moving parts, it’s almost impossible to do it manually and always do it correctly.

Karen: What other kinds of business logic changes are you dealing with in top of the massive amounts of raw data?

Michael: Most of the business logic changes are when they need to modify how something behaves based on new data that’s available, or when they need to start treating the data in a different way.  Sometimes there is a change in the way they want the overall system to behave. They sometimes have some classification rules for content they like to tweak occasionally.

Another thing we consider is the client’s relevancy scoring and query pre-processing rules. So you need to consider if you issue a query and it fails, what happens then? What kind of fallback query do you use?  All these things are part of the business logic that is independent of the raw data. In summary, we have the raw data and we can do a number of things with it. They often want us to change exactly what we’re doing with it, how we’re conditioning it, and how we’re transforming it. We either tweak what exists or take advantage of new data that they’ve started including in their data feeds. The challenge is all these elements can change frequently.

Karen: This site is more of a portal than strictly an enterprise search project, isn’t it?

Michael: Yes. Enterprise search usually refers to searching for documents within an organization. This client is a public facing search engine that allows the public to perform product search across a very large number of vendors and service providers.

Changes come from their advertisers and data they provide. Advertisers come and go. People pay for placement within certain industrial categories. It’s not like we get a static list of sites to crawl and that’s that. It changes weekly, sometimes daily. This list of sites we crawl is on a weekly or daily basis. Also things need to be purged from the index. Say an advertiser’s contract ends and suddenly we need to stop crawling a site with thousands documents; that data needs to be purged from the index promptly. Not only do we have to crawl new sites but purge old ones as well. This is a project that is so massive that it’s not cut and dried. A lot of software development projects focus on a clear cut problem, come up with plan, tackle it, release it, and then maintain it. We’re constantly getting new information and learn new things about people hitting the site.

Karen: So this sounds like this project is always in a state of ongoing development.

Michael: We are building something that’s never been built before. One of the goals is to make this site remarkable. And we’re very excited to be a part of that. The scale of the project is quite big though, which is why we started using Continuous Integration.

The way our cycles work is we perform big data updates, but by using CI, we can continuously update and integrate new data. We’re moving to a place, by using the practice of CI, we can perform a daily builds which gives us the time we need to fix problems before we absolutely need it to be live.

Karen: How do you implement CI into your day to day management of the project?

Michael: There are some pretty great open source tools that we’re using to implement CI. We use Jenkins to help us do Continuous Integration for frequent data builds, which is an intensive process for this particular client.

We field questions from the client about the status of different data builds. We hope to use Jenkins in conjunction with other tools to automatically build data and have event-based data builds. We’re looking at a way to have it triggered by some other event and have Jenkins automatically generate reports as the data is being built. Each time we run a build script, if the output differs from the previous build, Jenkins makes it easy for you to see that something is different. There is a way to modify your output that Jenkins can understand.  One of the cool things about Jenkins is they have graphs that illustrate differences to help us identify issues that could pose a potential problem and let us fix it before we need to go live with the data.

Karen: Any other tools?

Michael: For multi-node search clusters, we’re using a tool called fabric3 that uses SSH to copy data and execute scripts across multiple nodes of a cluster based upon roles. We have a clever set up where we’re able to inform fabric3 what services are running on each node in our cluster and have actions or commands linked to certain tasks, like building metadata.  By linking them, they automatically know which nodes to deploy data to.

Using open source tools like Jenkins and fabric3 make it a lot more manageable considering the large number of moving parts. It’s allowed us to be successful in building this incredible site and making the search function relevant, accurate and up to date.

Cloud Platforms: The Promise vs. The Reality

Recently our VP of Search, Michael McIntosh sat down and talked to me about his thoughts on cloud computing and what businesses should be aware of when investing in the cloud.


Karen: So, how does enterprise search and cloud computing fit together?  What’s good about it for companies?

Michael: The advent of cloud computing makes it a lot easier for companies to get into search without investing a huge sum of money up front. Some of the pay-as-you-go computing approaches make it possible to do things that in the past wouldn’t have been financially viable such as natural language processing on content.  Something that could have taken days, weeks, or even months can now take much less time by throwing more hardware at a problem for a shorter time span.

For example, you could throw 20 machines at a problem for 12 hours and do a bunch of computations in a massively parallel way, and then stop it as soon as it’s done….versus the old model where you have to buy all the hardware, or rent it, and make sure it’s not underutilized so you make your investment back.

But if you need a lot of processing power for a short amount of time, it’s really quite amazing what we can do now with an approach like this.

Karen: Is this a new technology for TNR?

Michael: TNR has been using cloud computing platforms for several years now—3 or 4 years.  Cloud computing in itself is sort of a buzz word, because distributed processing and hosting has been around for a while, but the pay-as-you-go computing model is relatively new. So we have a great deal of experience with the reality of cloud computing platforms vs. the promise of cloud computing platforms.

Karen: So, what is the difference between the “promise” and the “reality” of cloud computing platforms?

Michael: Well, A lot of people think of cloud computing as this magical thing; all their problems will be solved and it will be super dependable because there are very large businesses like Amazon running the underlying infrastructure and you don’t have to worry about it.

But, as the physical infrastructure becomes easier to deploy, other critical factors come into play. You won’t have to worry about the physical logistics of getting hardware in place. But, you will have to manage multiple instances, you have to make sure that when you provision temporary processing resources, you have to remember to retire it when it’s no longer needed. Otherwise you’ll be paying more than you need to. Since virtualization uses physical hardware you do not control or maintain—there are fewer warning signs to a potential systemic failure. Now Amazon, which is the one we use the most, does a good job of backing up instances and making things available to you even when there are failures. But we’ve had problems where we’ve lost entire zones. Even if we’ve had multiple machines configured with fault tolerance, Amazon has experienced outages that have taken entire regions offline despite every conservative effort to ensure continuous up time. So we’ve had our entire service clusters go down because of problems Amazon was having. It becomes critically important for companies to develop and maintain a disaster and recovery plan. Companies need to make sure things that are critical are backed up in multiple locations. Now historically, this has been hard to do because companies typically buy enough equipment for production needs, but not enough equipment for development and staging environments.

Karen: That sounds like a costly mistake.

Michael: It can be very costly because people often develop disaster recovery plans without ongoing testing to confirm the approach continues to work. If the approach is flawed, when you do suffer an outage, you can be offline for hours, days or weeks. Even worse, you may not be able to recover your data at all.

Karen: That sounds extremely costly.

Michael: Yes, it’s no fun at all.

There are upsides though. Some pluses are that cloud computing forces you to be more formal about how you manage your technical infrastructure. For example, for training purposes; with a new developer, we can just give them a copy of a production system, and have them go to town on it, make modifications, whatever without risking the actual production servers. And if they make a mistake, which is human (you have to factor in human error), you can reprovision a brand new one, and retire the one that is fouled up. Instead of having to spend hours and hours trying to fix the problem on the machine they were working on.

Karen: This sounds like it’s a lot more flexible and time efficient, with a layer of safety built in.

Michael: Yes. Cloud computing also comes in handy if you ever have a security breach. If a hacker gets into the system and the system is compromised–if this happens, system administrators can go in and try to correct the problem. But hackers can often install backdoors to get in and out. So a cloud platform with a good disaster contingency and backup can allow system administrators to bring a whole instance down and do the patch on a whole new machine without the security breaches and patches in place. This is pretty easy to do with a cloud platform.

Karen: So TNR can help their clients do all these things?

Michael: Yes, we’ve worked with large customers over many years and we’ve seen a wide variety of things that can possibly go wrong, and we’ve been through several physical service outages both with Amazon Web Services and with Rackspace.

Cloud computing in itself is no panacea, but if you have the technical and organization proficiency to effectively leverage the platform, it can be a powerful tool used to accelerate your company’s rate of innovation.

If you are assessing the cloud as a solution in your business, contact us.  There are a variety of options for hosting that can save your company money and minimize outages. Let us show you the option that is the best fit for your organization.

Migration from FAST ESP to Lucene Solr

Download the presentation and see the video.

Michael McIntosh, Vice President of Enterprise Search Technologies at TNR, spoke at the Lucene Revolution conference in Boston, MA October 7-8, 2010. Michael reviewed the migration from Fast ESP to Lucene/Solr open source search. He discussed approaches to identifying core content areas of HTML documents such as Text-To-Tag Ratio Heuristics and Page Stereotype/Site Template Analysis, and reviewed specific use cases that we have encountered as search integration experts and discuss available tools.

TNR Global was a sponsor of Lucene Revolution. The conference gathered over 400 professionals from the enterprise search industry. We were happy to see so much interest in Lucene/Solr open source search, and get to know and learn from the folks who have done large scale implementations, including Twitter, LinkedIn, and eHarmony.  Not surprisingly, there was a lot of interest about migration from proprietory search systems to Solr, especially from FAST ESP due to Microsoft’s discontinuing FAST ESP support for Linux.  If you would like to learn more about how a migration from FAST ESP to Lucene Solr can benefit your company, contact us for a free consultation.

Dynamic Fields in Apache Solr

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it.

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

How to get the MongoDB server version using PyMongo

If you’re using server-side features of MongoDB that have a minimum version requirement (like pushing a unique value to a list), it is a good idea to make sure you have the required version running on the server. To check the version of the MongoDB server using PyMongo, you can use something like this:

import pymongo

connection = pymongo.Connection()
serverVersion = tuple(connection.server_info()['version'].split('.'))
requiredVersion = tuple("1.3.3".split("."))
if serverVersion < requiredVersion:
    # handle the error
    return 1
...

It’s important to note that you must connect to the admin database to determine the version number. Otherwise, you will probably run into something like this:

pymongo.errors.OperationFailure: command SON([('buildinfo', 1)]) failed: access denied

If you need to check the version of the server from the interactive prompt, run the following from the mongo prompt:

> db.version()
1.4.2

How to create a duplicate ESP collection without re-crawling!

In a production (or even stable) ESP environment, it is difficult to make a change to the Document Processing Pipeline and test it without wiping out the existing collection (not to mention the time it takes to perform a full re-crawl if the collection is even moderately large). In this case, the best option is to use postprocess to feed existing documents to a new (empty) collection.

Making a duplicate collection provides several benefits:

  • No re-crawling is required
  • The original collection is not affected by pipeline changes
  • You can test your new collection without touching the stable data
  • Upon determining that your changes are producing good results, you can easily migrate your front-end to the new collection while still maintaining existing stable data in the original collection (in case you want to revert your changes)

Steps to make a duplicate collection

  1. Using the ESP Admin GUI, create a new collection with the pipeline you would like to use (or test, as the case may be)
  2. Do not specify any data sources when configuring the new collection
  3. Stop the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl stop crawler

  4. Run the following command where origcollection is the original collection and newcollection is the new collection (that you just created):

    $FASTSEARCH/bin/postprocess -R origcollection -k default:newcollection

    Notes about this command:

    • the default specified above is a content feeding destination, as specified in the destinations section of $FASTSEARCH/etc/CrawlerGlobalDefaults.xml. Specifying default will specify the destination as the current ESP install.
    • be sure to run the above command using either nohup or screen as it will not exit until all content has been fed to the new collection. For large collections this may take a while.
  5. Restart the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl start crawler