iPod Giveaway at the UMASS Career Fair September 28

TNR Global will be attending the UMASS Career Fair on Wednesday, September 28, 2011. Drop off a copy of your physical resume and you will have a chance to win an iPod Shuffle.  We will be at table #29 from 10:00AM-3:00PM.

cheap-apple-ipods-2-465x248

Email us a cover letter and resume and you will have a chance to win a second iPod Shuffle.

Stop by and chat with alumnus Managing Director Natasha Goncharova, ’03 or Director of Business Development Karen Lynn, ’92 about internships and an opportunity for a career in Enterprise Search Technology at TNR Global, LLC.

“The fine print….”

Send resume and cover letter to jobs@tnrglobal.com with the subject heading “UMASS Career Fair.”  All entries will be entered into the raffle and chosen at random to win an iPod Shuffle 2GB,  Please email resumes as simple text email or as web URLs.

ELIGIBILITY:  VOID WHERE PROHIBITED OR RESTRICTED BY LAW. NO PURCHASE NECESSARY. The Raffle is open to everyone worldwide who is at least 18 years of age. Employees, officers, directors, representatives, and immediate family members of TNR Global, LLC and their respective parent companies, affiliates and subsidiaries are not eligible to participate in this Raffle.


TNR Global to Attend UMASS Career Fair in September 2011

TNR Global has reserved space at UMASS/Amherst’s Career Fair for Engineering, Natural Sciences & Technology students. “The University of Massachusetts offers a comprehensive Computer Science program where students emerge as strong candidates for the kind of technical work required of TNR software developers,” said Michael McIntosh, VP of Search Technologies for TNR. UMASS/Amherst’s Computer Science Major is ranked in the top 20 Universities for Computer Science by US News & World Report. The fair will take place on September 28th, 2011 from 10-3:00PM in the Campus Center Auditorium. Alumni Karen Lynn and Natasha Goncharova will be representing TNR Global. Stop by and say hello!

TNR Global to present at Apache Lucene Eurocon 2011 in Barcelona

We are happy to announce that TNR Global’s own Michael McIntosh will be presenting at the Apache Lucene Eurocon 2011 in Barcelona this October.  Michael’s talk is titled Enterprise Search:  FAST ESP to Lucene Solr.” His presentation will discuss migration from the FAST ESP platform to a Lucene Solr search platform. There are many reasons an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users. Illustrated through actual case studies, the presentation will include challenges and concerns, present solutions and work-arounds to overcome migration issues.

Michael has more than 16 years of experience in large scale systems design and operation, online consumer product development, high volume transaction processing and engineering management. He has extensive experience developing, integrating and maintaining search technology solutions for companies such as FAST Search and Lycos.

We’re excited that Michael will be presenting in Barcelona this fall.  Please introduce yourself if you’re able to go!

Dynamic Fields in Apache Solr

So, you’ve installed a fresh copy of <a taget=”_blank” href=”http://lucene.apache.org/solr/”>Apache Solr</a>. You have tested it out running the examples from the <a href=”http://lucene.apache.org/solr/tutorial.html”>Solr tutorial</a>. And now you are ready to start indexing some of your own data. Just one problem. The fields for your own data are not recognized by your Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are all defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your search project, and even if you are willing to put up with misnamed fields at least for experimenting with your instance, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to add to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields – fields that are defined in the schema with a glob-like pattern that is either at the beginning or end of the name. Further, there are pre-defined dynamic fields for most of the common data-types that you may use, in the default schema. Here are the some of the dynamic fields that are defined in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it. After you’ve indexed some data, you can actually view this dynamically created field in the schema viewer for your instance, located at http;//YOUR-INSTANCE/admin/schema.jsp

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

TNR Global to Attend Enterprise 2.0 Conference in Boston

We’re excited to announce that we’ll be in attendance at the Enterprise 2.0 Conference in Boston June 21-23, 2011.  Managing Director Natasha Goncharova and Director of Business Development Karen Lynn will be attending.  If you see us, be sure to say hello!

Webinar: Solr To The Rescue: Successful Migration From FAST ESP to Open Source Search Based on Apache Solr.

On Nov 18, 2010, 14:00 EST (19:00 GMT),  join us for the Webinar, organized by Lucid Imagination:  Solr To The Rescue: Successful Migration From FAST ESP to Open Source Search Based on Apache Solr.

Michael McIntosh, VP Search Solutions, TNR Global will be on the webinar panel along with Eric Gaumer, Chief Architect, ESR Technology, and Helge Legernes, Founding Partner & CTO, Findwise discussing approaches for migrations from FAST ESP to Open Source Search based on Apache Solr.

Webinar description:

Users of FAST ESP have become increasingly concerned since the purchase of Fast Search and Transfer by Microsoft in 2008. The discontinuation of the Linux platform and the subordination of FAST features into Sharepoint have created a greater sense of urgency to seek out alternatives. Many FAST ESP users are considering open source enterprise search based on Apache Solr as an option. This roundtable discussion will provide valuable insights for search users looking to make the change to Solr.

The panel covers factors driving the need for a FAST ESP alternatives, differences between FAST and Solr, typical migration project lifecycle & methodology, complementary open source tools, migration pro’s & con’s, best practices and customer examples, and recommended next steps.

Register

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python26
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

limit = 0 # How many iterations max?  Enter 0 for no limit.
solrUrl = 'http://localhost:8080/sitemap-indexer-test' # The URL of the solr instance
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9' # The xmlns for the sitemap schema

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		break; # For testing, you can set a limit to how many pages of the sitemap to consider

	url = urlElem.text # Get the url text from the element

	try:
		response = urllib2.urlopen(url) # Try to get the page at url
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try:
		soup = BeautifulSoup(response.read()) # Try to parse the HTML of the page
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try:
		title = soup.html.head.title.string.decode("utf-8") # Try to set the title
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try:
		body = str(soup.html.body).decode("utf-8") # Try to set the body
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Note, decode("utf-8") is used to avoid non-ascii characters in the solrInstance.add below

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try:
		# Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try:
	solrInstance.commit() # Commit the additions
except:
	print "Could not Commit Changes to SOLR Instance - Check SOLR logs for more info"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

Enterprise Search Usability

Search needs to be easy to use if you expect your end users to benefit from it. 

A few examples of usability:

  • A query/search field needs to be clearly visible on the page where users are prompted to search for information.
  • If a query returns too many results to be useful, the user can choose to ‘drill down’ to the essentials. They can limit the search to English language documents, or document created in the past 30 days, or documents only from the finance department. Custom qualifiers can be defined.
  • When a search for ‘doors’ returns thousands of results, a navigation tool on the side will provide helpful links to the different categories – aluminum doors, wooden doors, folding doors, garage doors, etc. When you choose aluminum doors from the list, the next menu allows you to limit results by size of door, by manufacturer, by location, etc.
  • Pre-defined searches can automatically check new data and send an email to the user, or an alert to a mobile phone.

With so many features available, it is vital to implement a customized enterprise search solution based on an in-depth study of your needs. TNR Global engineers can help you at any stage of development: selecting between proprietary vendors and open source solutions, hardware options and level of upgrades.