scripts – TNR Global

Bishop: Makes Your Web Service Shiny

The idea is to provide a relatively small library that will make your life easier and hopefully more pleasant by making it straightforward to provide a consistent web service API that obeys HTTP semantics.

Christopher Miles, one of our Senior Software Developers here at TNR, wrote this post on Bishop. It all started when he was asked the question:

“What happens if I send it something that’s not JSON?”…….“I don’t know, but I bet it logs a really big stack trace!”

The question begged an answer, and Chris give an extremely thorough answer in his own blog. Here’s a small except that gives a taste of his analysis:

After taking a closer look at HTTP and it’s specification, it was clear that it could do a lot more than I had thought. Looking back on past projects, it’s painfully obvious that I’ve been taking what is really an application protocol and ignoring all of the interesting bits, instead using it as little more than a pipe to push documents through. I’ve been using either the requested URL or parameters or maybe even neither and simply examined the body content, thus eliminating any of the real advantages of using HTTP in the first place.

And there are advantages. The protocol is already thinking about caching your data where it makes the most sense. There’s already an algorithm for taking the list of content types that the client wants and the content types the server provides and picking the best match. It can manage safe updating of resources as well as notifying the client of conflicts. And so on, by ignoring what the HTTP protocol has to offer I was making more work for myself.

So he decided to take matters into his own hands by creating a library. More from Chris’ blog:

The idea is to provide a relatively small library that will make your life easier and hopefully more pleasant by making it straightforward to provide a consistent web service API that obeys HTTP semantics. It will make the lives of those around you easier as well, clients can expect your service to respond to the common HTTP request methods with reasonable responses. Placing caches around your service will also be much simpler and you’ll have some level of control over how your service’s data is cached.

Since creating this library, other developers have responded positively and are watching the project. If you would like to take a look at how our approach to solving this problem, take a look for yourself here.

If you’d like to talk to us on how we can solve some of your enterprise search, cloud, or scalability issues, contact us.

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes. Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

Solrpy – http://code.google.com/p/solrpy/
BeautifulSoup – http://www.crummy.com/software/BeautifulSoup/

You can install these using easy_install or manually.

You will also require an Apache Solr instance. (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap. For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php. You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla. There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

Solrpy – http://code.google.com/p/solrpy/
BeautifulSoup – http://www.crummy.com/software/BeautifulSoup/

You can install these using easy_install or manually.

You will also require an Apache Solr instance. (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python26
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

limit = 0 # How many iterations max?  Enter 0 for no limit.
solrUrl = 'http://localhost:8080/sitemap-indexer-test' # The URL of the solr instance
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9' # The xmlns for the sitemap schema

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		break; # For testing, you can set a limit to how many pages of the sitemap to consider

	url = urlElem.text # Get the url text from the element

	try:
		response = urllib2.urlopen(url) # Try to get the page at url
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try:
		soup = BeautifulSoup(response.read()) # Try to parse the HTML of the page
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try:
		title = soup.html.head.title.string.decode("utf-8") # Try to set the title
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try:
		body = str(soup.html.body).decode("utf-8") # Try to set the body
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Note, decode("utf-8") is used to avoid non-ascii characters in the solrInstance.add below

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try:
		# Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try:
	solrInstance.commit() # Commit the additions
except:
	print "Could not Commit Changes to SOLR Instance - Check SOLR logs for more info"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

Once finished, it will output the number of documents that were committed to the Solr index.

Perl script documentation with Pod::Usage

One of the most important parts of maintainable and usable system administration scripts is documentation. Code comments are a key component to this, but so is usage documentation. In this post, I’ll address an easy way to add complete and useful usage documentation to your scripts using the module Pod::Usage.

Pod::Usage is a module that lets you easily convert Pod documentation to a help message or man page. To use it, you just need to include several Pod sections in your documentation. These include the NAME, SYNOPSIS, OPTIONS, and DESCRIPTION sections. Pod::Usage will use some of these sections to generate usage information and man pages. Once you’ve written all of your documentation, you can use Getopt::Long to capture options passed to the script like –help and –man. For these options, you can run the function ‘pod2usage’, with varying levels of verbosity.

For example, ‘pod2usage(-verbose => 1)’ will print out a short usage message (generated from the SYNOPSIS Pod section). To open a full man page in your default man viewer, you can use a verbosity of 2. Additionally, you can print out a string before your generated usage message, by providing the ‘-msg’ option inside the function call. For example, ‘pod2usage(-msg => “Not enough arguments”, -verbose => 1)’ will print “Not enough arguments”, followed by the usage message for your script.

You can find documentation on the Pod::Usage module as well as example code on perldoc.perl.org.

Working with missing Makefile.PL scripts in CPAN modules

Some versions of CPAN (notably, the one that ships with Red Hat / CentOS 4.6, v1.7601) will automatically attempt to create a ‘Makefile.PL’ script if a module you’re trying to install through CPAN is missing one. However, sometimes this may lead to an error during the module installation, usually something like: “Too early to specify a build action ‘Build’. Do ‘Build Build’ instead.”. This appears to be caused by a missing argument to ExtUtils::MakeMaker’s WriteMakefile subroutine. If the ‘PL_FILES’ argument doesn’t exist, MakeMaker will incorrectly attempt to use the Build.PL file included with the module. Continue reading “Working with missing Makefile.PL scripts in CPAN modules”

Securely specify mysql credentials in automated scripts

Often, you may want to run a script that uses a username and password to access data in a MySQL database. Securely running a script like this manually is easy – simply use the ‘-p’ option for the MySQL client, and it will prompt you for the password. However, this is not an option if you want to automate the script.

There are several ways to provide the password in a way that can be used with automated scripts, but only one that is both flexible and secure. You can specify the password on the command line itself (with ‘mysql -p ‘); however, this allows the password be seen by other users who run commands like ‘ps’. Another option is setting the environment variable “MYSQL_PWD” to the password, but this can also be seen by other users. Continue reading “Securely specify mysql credentials in automated scripts”

Human readable disk usage

Finding out what directories and files are using up the disk space on your server is fairly easy with du, but the output is not always easy to read.

However, it’s not too hard to pretty up the output with some perl and the module Number::Bytes::Human (available from CPAN). To convert normal du output into a more human readable form, which shows file size in the correct units (K, M, or G) and also includes the percentage of the total space each entry is using, use the follwing steps.

Finding out what directories and files are using up the disk space on your server is fairly easy with du, but the output is not always easy to read.