Michael McIntosh, TNR Global, speaking at Lucene Revolution – The First Conference Dedicated to Open Source Enterprise Search

mmphotoMichael McIntosh, Vice President of Enterprise Search Technologies at TNR, will be presenting at the Lucene Revolution conference in Boston, MA October 7-8, 2010.

Michael will review the migration from commercial search platforms (focusing on Fast ESP) to Lucene/Solr open source search. He will discuss approaches to identifying core content areas of HTML documents such as Text-To-Tag Ratio Heuristics and Page Stereotype/Site Template Analysis, and will review specific use cases that we have encountered as search integration experts and discuss available tools.

For more information see:
http://lucenerevolution.com/speakers-bios#McIntosh

New Website for Innovara, Inc.

TNR Global Joomla! Services launched the new Innovara, Inc. site this summer. Innovara, with headquarters in Hadley, MA and with a global reach, provides highly customized training, medical thought leadership, and business development services for the healthcare, pharmaceutical, biotechnology, and medical device industries.
The new Innovara website was created with the Joomla! Content Management System. Highlights include a custom design provided by Meg McCarthy Design, document management, multilingual home pages, and the implementation of advanced access control list capabilities, to create custom groups of staff and clients who can access and add content to specific areas of the website.

Intermediate Joomla! Workshop held August 5th, 2010 in Hadley, MA

Tamar Schanfeld of TNR Global led an all day Intermediate Joomla! Workshop on August 5th, 2010 in Hadley, MA.  Topics included: backups, security, search engine optimization, social networking, blogs, and more.

Participant feedback was very positive!   “Lots of info – very pleased,” “Exceeded expectations,” “A very patient instructor”.

Interested in taking a Joomla! workshop? Email us at info@tnrglobal.com and we’ll notify you of upcoming Joomla! classes and events. TNR Global also provides online training and small group training for businesses. Contact us today!

For those who were not able to attend – TNR will be posting a Joomla! tip of the day on our TNR Joomla! Blog.

TNR Global to present at Lucene Revolution Conference

Lucene Revolution Conference Logo
TNR Global has been selected to present at the Lucene Revolution conference in Boston, MA October 7-8, 2010. Michael McIntosh, Vice President of Enterprise Search Technologies at TNR, will speak on Friday, October 8th regarding the migration from commercial search platforms (focusing on Fast ESP) to Lucene/Solr open source search. He will discuss approaches to identifying core content areas of HTML documents such as Text-To-Tag Ratio Heuristics and Page Stereotype / Site Template Analysis, will review specific use cases that we have encountered as search integration experts and discuss available tools.

Lucene Revolution is the first conference dedicated to open source search in North America. The two-day conference is packed with technical sessions, developer content, user case studies, panels, and networking opportunities. Attendees will learn new ways to develop, deploy, and enhance search applications using Lucene/Solr. For more information, and to register, visit http://www.lucenerevolution.com.

Dynamic Fields in Apache Solr

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it.

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

Joomla! Workshop at MSBDC – Springfield, MA

On May 26, 2010, TNR Global Joomla! Services held a workshop “Create an Interactive Website for Your Small Business – A One Day Bootcamp” for the Massachusetts Small Business Development Center in Springfield, MA.   The workshop was taught by Tamar Schanfeld, Joomla! Project Manager and Natasha Goncharova, Managing Director at TNR Global.

The workshop covered the fundamentals of the Joomla! Open Source Content Management System, and provided participants with the foundations for creating a small business website of their own. Participants included Co-op Power Inc., Good Time Stove Company, Zaccheo Properties, and more.  See some of our students’ sites:

Three Sisters Sanctuary:  http://www.threesisterssanctuary.com.
Christine’s Cuisine http://christinescuisine.net
Zaccheo Properties http://zaccheoproperties.com/

How to get the MongoDB server version using PyMongo

If you’re using server-side features of MongoDB that have a minimum version requirement (like pushing a unique value to a list), it is a good idea to make sure you have the required version running on the server. To check the version of the MongoDB server using PyMongo, you can use something like this:

import pymongo

connection = pymongo.Connection()
serverVersion = tuple(connection.server_info()['version'].split('.'))
requiredVersion = tuple("1.3.3".split("."))
if serverVersion < requiredVersion:
    # handle the error
    return 1
...

It’s important to note that you must connect to the admin database to determine the version number. Otherwise, you will probably run into something like this:

pymongo.errors.OperationFailure: command SON([('buildinfo', 1)]) failed: access denied

If you need to check the version of the server from the interactive prompt, run the following from the mongo prompt:

> db.version()
1.4.2

TNR Global and STCC organized CloudCamp Western Massachusetts

 width=

TNR Global was the co-organizer and a sponsor of CloudCamp Western Massachusetts that took place on April 20, 2010, 2:30pm-7pm, at the National Science Foundation funded National Center for Information and Communications Technologies (ICT Center) at Springfield Technical Community College. This event was co-organized by CloudCamp co-founder Dave Nielsen and the ICT Center.

Developers, decision makers, end users, and vendors from MA, CT, VT, and surrounding states participated and presented at the event. CloudCamp Western Massachusetts provided a central point for bringing together local academia and businesses. The ICT Center streamed live video of the event to other technology community colleges around the nation. http://www.cloudcamp.org/westernmass

Presentations can be seen here.

Pictures can be seen here.

The speakers included:
David Irwin, UMass Amherst CS Department
Alex Barnett, Intuit Partner Program
Rich Roth, CEO, TNR Global
Chris Bowen, Microsoft Azure
Jim Kurose, Mass Green High Performance Computing Center

Upcoming Joomla! Workshop by TNR Global — May 26, 2010

TNR Global will hold the second Joomla! workshop in Springfield, MA on May 26, 2010.

‘Create an Interactive Website for Your Small Business – A One Day Bootcamp’

Learn the basics of Joomla! – the popular open source content management system that lets you build complex websites without a programming or design background. Create a new website including event calendar, photo gallery, contact  form,and more. Learn to plan your site, enter and edit content and menus, and install extensions.

Minimum technical skills required:  comfort with Microsoft Word and your internet browser. Note: this workshop will not include e-commerce or shopping cart features.

Date: Wednesday, May 26, 2010
Time: 9:00 a.m. – 4:30 p.m.
Location: Scibelli Enterprise Center, 1 Federal Street, Springfield (directions)
Cost: $75
Contact: Western Regional Office SBDC at 413-737-6712 or msbdc@msbdc.umass.edu.
(The MSBDC only accepts payment by check, but you may submit your registration online and mail a check.)

Registration is through the MSBDC
http://www.msbdc.org/wmass/training_reg.html