If you are looking for a fast and easy way to populate a Solr instance using Python, read on.
The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes. Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.
While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.
To start, you need to be running Python 2.6 and have the following modules installed:
- Solrpy – http://code.google.com/p/solrpy/
- BeautifulSoup – http://www.crummy.com/software/BeautifulSoup/
You can install these using easy_install or manually.
You will also require an Apache Solr instance. (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)
Ideally you will use this script on your own sitemap. For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php. You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla. There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.
We will also assume that you have the default Solr schema.xml installed.
Write the following python script sitemap-indexer.py
, replacing the value for solrUrl
with the location of your own instance:
#! /usr/bin/env python26 """ Index links from a sitemap to a SOLR instance""" import sys from BeautifulSoup import BeautifulSoup import solr import hashlib import urllib2 from xml.etree.ElementTree import parse limit = 0 # How many iterations max? Enter 0 for no limit. solrUrl = 'http://localhost:8080/sitemap-indexer-test' # The URL of the solr instance sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9' # The xmlns for the sitemap schema if len(sys.argv) != 2: print 'Usage: ./sitemap-indexer.py path' sys.exit(1) sitemapTree = parse(sys.argv[1]) solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object counter = 0 numAdded = 0 # Find all of the URLs in the form <url>...<loc>URL</loc>...</url> for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)): counter = counter + 1 # Increment counter if limit > 0 and counter > limit: break; # For testing, you can set a limit to how many pages of the sitemap to consider url = urlElem.text # Get the url text from the element try: response = urllib2.urlopen(url) # Try to get the page at url except: print "Error: Cannot get content from URL: "+url continue # Cannot get HTML. Skip. try: soup = BeautifulSoup(response.read()) # Try to parse the HTML of the page except: print "Error: Cannot parse HTML from URL: "+url continue # Cannot parse HTML. Skip. if soup.html == None: # Check if there is an <html> tag print "Error: No HTML tag found at URL: "+url continue #No <html> tag. Skip. try: title = soup.html.head.title.string.decode("utf-8") # Try to set the title except: print "Error: Could not parse title tag found at URL: "+url continue #Could not parse <title> tag. Skip. try: body = str(soup.html.body).decode("utf-8") # Try to set the body except: print "Error: Could not parse body tag found at URL: "+url continue #Could not parse <body> tag. Skip. # Note, decode("utf-8") is used to avoid non-ascii characters in the solrInstance.add below # Get an md5 hash of the url for the unique id url_md5 = hashlib.md5(url).hexdigest() try: # Add to the Solr instance solrInstance.add(id=url_md5,url_s=url,text=body,title=body) except Exception as inst: print "Error adding URL: "+url print "\tWith Message: "+str(inst) else: print "Added Page \""+title+"\" with URL "+url numAdded = numAdded + 1 try: solrInstance.commit() # Commit the additions except: print "Could not Commit Changes to SOLR Instance - Check SOLR logs for more info" else: print "Success. "+str(numAdded)+" documents added to index"
Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml
It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.
Once finished, it will output the number of documents that were committed to the Solr index.
You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.
If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/
directory of a new install of Solr.
If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.