Dynamic Fields in Apache Solr

July 1st, 2010 by Jeff Peck

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it.

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.

How to Index a Site with Python Using solrpy and a Sitemap

July 1st, 2010 by Jeff Peck

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

How to get the MongoDB server version using PyMongo

May 20th, 2010 by Ross Farinella

If you’re using server-side features of MongoDB that have a minimum version requirement (like pushing a unique value to a list), it is a good idea to make sure you have the required version running on the server. To check the version of the MongoDB server using PyMongo, you can use something like this:

import pymongo

connection = pymongo.Connection()
serverVersion = tuple(connection.server_info()['version'].split('.'))
requiredVersion = tuple("1.3.3".split("."))
if serverVersion < requiredVersion:
    # handle the error
    return 1
...

It’s important to note that you must connect to the admin database to determine the version number. Otherwise, you will probably run into something like this:

pymongo.errors.OperationFailure: command SON([('buildinfo', 1)]) failed: access denied

If you need to check the version of the server from the interactive prompt, run the following from the mongo prompt:

> db.version()
1.4.2

How to create a duplicate ESP collection without re-crawling!

March 29th, 2010 by Ross Farinella

In a production (or even stable) ESP environment, it is difficult to make a change to the Document Processing Pipeline and test it without wiping out the existing collection (not to mention the time it takes to perform a full re-crawl if the collection is even moderately large). In this case, the best option is to use postprocess to feed existing documents to a new (empty) collection.

Making a duplicate collection provides several benefits:

  • No re-crawling is required
  • The original collection is not affected by pipeline changes
  • You can test your new collection without touching the stable data
  • Upon determining that your changes are producing good results, you can easily migrate your front-end to the new collection while still maintaining existing stable data in the original collection (in case you want to revert your changes)

Steps to make a duplicate collection

  1. Using the ESP Admin GUI, create a new collection with the pipeline you would like to use (or test, as the case may be)
  2. Do not specify any data sources when configuring the new collection
  3. Stop the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl stop crawler

  4. Run the following command where origcollection is the original collection and newcollection is the new collection (that you just created):

    $FASTSEARCH/bin/postprocess -R origcollection -k default:newcollection

    Notes about this command:

    • the default specified above is a content feeding destination, as specified in the destinations section of $FASTSEARCH/etc/CrawlerGlobalDefaults.xml. Specifying default will specify the destination as the current ESP install.
    • be sure to run the above command using either nohup or screen as it will not exit until all content has been fed to the new collection. For large collections this may take a while.
  5. Restart the Enterprise Crawler:

    $FASTSEARCH/bin/nctrl start crawler

Fast ESP Error: no doc procs registered to process a batch with priority 0

March 2nd, 2010 by Ross Farinella

Just wanted to take this error message off of the, “Hey, we’ve seen this before… now how did we resolve this..?” pile.  This is the full text of the error:

WARNING    Could not send batch to ESP content distributor, will retry automatically.
Reason given: process() failed: exception (no_resources) no doc procs registered to
process a batch with priority 0

At first glance, it looks pretty clear that you just need to [re]start your document processor(s).  However, this won’t necessarily solve the problem.  Turns out that the a likely reason for this to pop up is a bad Document Processing Pipeline (DPP) Stage.  The docprocs fire up, hit the bad stage (e.g. python errors etc.) and don’t recover.

To debug your DPP Stage, take a look at the logs for the document processor(s).  They’re usually located in $FASTSEARCH/var/log/procserver and, in our experience, there’s probably an uncaught python exception lurking somewhere in there.

Postfix SMTP AUTH w/TLS

January 25th, 2010 by Michael Klatsky

One of the widely discussed issues with Amazon EC2 instances is the inability to reliably send email from the instances. In all too many cases, email from EC2  instances is automatically categorized as spam by the various relay databases, and by many ISP’s and carriers. There are several solutions, with the most common being a smarthost setup using either an external smarthost smtp service, such as http://authsmtp.com, or using an existing smtp server within our infrastructure. Read the rest of this entry »

Integrate custom services with the Fast ESP Node Controller

January 12th, 2010 by Ross Farinella
Add your own services to ESP's Node Controller

Add your own services to ESP's Node Controller

Background

Integrating our own custom services with Fast ESP’s Node Controller provides us with several benefits:

  • Administrators without in-depth ESP knowledge can easily control services (e.g. start, stop, configure parameters)
  • Services can be started at boot time with the rest of ESP
  • espdeploy can be used to install our services in a multi-node cluster

The components required for this system are:

  1. The ESP Node Controller (with config file NodeConf.xml)
  2. The 3rd-party service (like a CherryPy server, log parser, etc.)
  3. A wrapper script (see below)

Steps for Integration

  1. Define the service you would like to integrate. It can be any script or binary that can be executed on the system. For example, the service might be a python script that takes command-line arguments and continues running itself (as is the case with a webserver).
  2. Create the wrapper script that sets up the proper environment and runs/stops the service properly. The wrapper script should be put in the $FASTSEARCH/bin directory (with executable permissions). Additionally, the wrapper script should pass $@ to your actual script so any/all arguments defined in $FASTSEARCH/etc/NodeConf.xml will be passed along properly from the Node Controller to your service. The following is an example of a wrapper script:
    #!/bin/sh
    
    # export the proper python path
    export PYTHONPATH=":/path/to/python"
    
    # run the script (backgrounded)
    python $FASTSEARCH/lib/python2.6/yourmodule/yourservice.py $@ &
    
    # determine the process id of the python script
    SCRIPT_PID="$!"
    
    # upon receiving a SIGTERM, forward it to the process
    trap "kill -TERM $SCRIPT_PID" SIGTERM
    
    # wait for SIGTERM from nctrl
    wait
  3. Define the service in $FASTSEARCH/etc/NodeConf.xml
    Add the following to the end of the <startorder> tag:
    <proc>servicename</proc>

    Add the following to the end of the <node> tag, customizing as appropriate:

    <!-- My Custom Service -->
    <process name="servicename" description="My Custom Service">
            <start>
                    <executable>binaryname</executable>
                    <parameters>-p 16940 -v</parameters>
                    <port base="4004"/>
            </start>
            <outfile>servicename.scrap</outfile>
    </process>
  4. Reload the Node Controller configuration with the following:
    nctrl reloadcfg

And that’s it!  Now you should be able to start, stop, configure, and deploy your services using Fast ESP tools.  Enjoy!

Cloud Enabled Personalized Genomics

November 13th, 2009 by Cheryl Handsaker

Personalized medicine is a goal of the Department of Health and Human Services. It is a driver of genomic research. It is one version of the future of medicine, using our unique genetic code toward the prevention of disease and the use of more effective or safer tailored drug therapies. Cloud computing enables access to the computational resources needed, on demand, for the data analysis needed to lay the groundwork for revolution in health care. Read the rest of this entry »

Will Amazon AWS save me money?

November 10th, 2009 by Michael Klatsky

One of this first questions we asked when deciding to sue Amazon’s AWS services was: Will we save money?

At first, your EC2 servers may be simply architected, perhaps small instances. Once a more serious commitment is made for a more robust architecture at Amazon, there will be additional costs introduced. For example, a small instance, running a 2 disk 100GB EBS volume w/RAID0 plus monitoring and a static IP, and the cost goes from the current ~$68 monthly per system to ~$88 monthly per system. Bump the instances to large instances (likely needed for any server running mysql, or any other equally intensive application) and that cost goes to ~$265 per instance. Add in the costs of the additional services like bandwidth, Static IP, Cloudwatch etc and the costs can quickly escalate. Of course, upfront payments for Reserved Instances can drastically reduce the costs further.

However, the savings in development and deployment costs I think far outweighs a narrower gap in the savings between physical and AWS servers, and the real MRC on the AWS servers will likely be lower for a given amount of computing resources.

So- can you save money? Yes. In some cases, it will be a direct apples to oranges savings of hard dollars. In other cases, the agility gained will provide the greatest savings. In most cases- a combination of both will drive your cost savings.

A better way to add or update MySQL rows

October 22nd, 2009 by Ross Farinella

Recently, we needed to iterate over a fairly large data set (on the order of millions) and do the ever-common If it’s not in the database, put it in.  If it’s already there, just update some fields. It’s a pattern that is very common for things like log files (where, for example, only a timestamp needs to be updated in some cases).

The obvious way of doing a SELECT, followed by either an UPDATE or an INSERT is too slow for even moderately-large datasets.  The better way to accomplish this is to use MySQL’s ON DUPLICATE KEY UPDATE directive.  By simply creating a unique key on the fields that should be different per-row, this syntax provides two specific benefits:

  • Allows batch (read: transaction) queries for large data
  • Increases performance overall versus making two separate queries

These benefits are especially helpful when your dataset is too large to fit into memory.  The obvious drawback to this method, however, is that it may put additional load on your database server.  Like anything else, it’s worth testing out your individual situation but, for us, ON DUPLICATE KEY UPDATE was the way to go.