Archive for the ‘scripts’ Category

How to Index a Site with Python Using solrpy and a Sitemap

by Jeff Peck

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you need assistance with any part of the Solr implementation, feel free to contact us.  We can get you started, step in and assist, or perform and entire integration of a Solr solution.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

Perl script documentation with Pod::Usage

by admin

One of the most important parts of maintainable and usable system administration scripts is documentation. Code comments are a key component to this, but so is usage documentation. In this post, I’ll address an easy way to add complete and useful usage documentation to your scripts using the module Pod::Usage.

Pod::Usage is a module that lets you easily convert Pod documentation to a help message or man page. To use it, you just need to include several Pod sections in your documentation. These include the NAME, SYNOPSIS, OPTIONS, and DESCRIPTION sections. Pod::Usage will use some of these sections to generate usage information and man pages. Once you’ve written all of your documentation, you can use Getopt::Long to capture options passed to the script like –help and –man. For these options, you can run the function ‘pod2usage’, with varying levels of verbosity.

For example, ‘pod2usage(-verbose => 1)’ will print out a short usage message (generated from the SYNOPSIS Pod section). To open a full man page in your default man viewer, you can use a verbosity of 2. Additionally, you can print out a string before your generated usage message, by providing the ‘-msg’ option inside the function call. For example, ‘pod2usage(-msg => “Not enough arguments”, -verbose => 1)’ will print “Not enough arguments”, followed by the usage message for your script.

You can find documentation on the Pod::Usage module as well as example code on perldoc.perl.org.

Working with missing Makefile.PL scripts in CPAN modules

by admin

Some versions of CPAN (notably, the one that ships with Red Hat / CentOS 4.6, v1.7601) will automatically attempt to create a ‘Makefile.PL’ script if a module you’re trying to install through CPAN is missing one. However, sometimes this may lead to an error during the module installation, usually something like: “Too early to specify a build action ‘Build’.  Do ‘Build Build’ instead.”. This appears to be caused by a missing argument to ExtUtils::MakeMaker’s WriteMakefile subroutine. If the ‘PL_FILES’ argument doesn’t exist, MakeMaker will incorrectly attempt to use the Build.PL file included with the module. Read the rest of this entry »

Securely specify mysql credentials in automated scripts

by admin

Often, you may want to run a script that uses a username and password to access data in a MySQL database. Securely running a script like this manually is easy – simply use the ‘-p’ option for the MySQL client, and it will prompt you for the password. However, this is not an option if you want to automate the script.

There are several ways to provide the password in a way that can be used with automated scripts, but only one that is both flexible and secure. You can specify the password on the command line itself (with ‘mysql -p ‘); however, this allows the password be seen by other users who run commands like ‘ps’. Another option is setting the environment variable “MYSQL_PWD” to the password, but this can also be seen by other users. Read the rest of this entry »

Human readable disk usage

by admin

Finding out what directories and files are using up the disk space on your server is fairly easy with du, but the output is not always easy to read.

However, it’s not too hard to pretty up the output with some perl and the module Number::Bytes::Human (available from CPAN). To convert normal du output into a more human readable form, which shows file size in the correct units (K, M, or G) and also includes the percentage of the total space each entry is using, use the follwing steps. Read the rest of this entry »