Posts Tagged ‘Solr’

The Future of Search Doesn’t Come in a Box: The Google Mini Says Goodbye

by Karen Lynn

The future of search doesn’t come in a box.

Last week while many were on vacation, Google abandoned the smallest member of its’ Search Appliance family, the Google Mini. The small blue piece of external hardware was used for smaller data sets with a stable, some might say stagnant, data with slow and steady query rates. If you were a smaller business with search demands that weren’t, well–too demanding, then this piece of hardware could help you for a reasonable price tag.

Search evolves like all technologies do. Developers incorporate emerging technologies into their skill sets, and open source technologies like Lucene Solr have matured into a competitive option for companies of all sizes. IT managers are finally ready to move away from the confines of a Search Appliance in a box and move to a more agile solution that can offer room for growth, a lightweight application, and a healthy and growing community. Without the hefty annual licensing fees of a commercial product, Solr can save small to mid sized companies and startups valuable cash resources to invest in other areas of their respective businesses.

Open source technologies aside, many are speculating if Google will retire some of its other pieces of hardware like the well know GSA (Google Search Appliance). Although Google has a newly released version 6.14 with an updated website to easily explain features. Google continues evolving its enterprise search offerings to include a hosted search solution for e-tailers called Google Commerce Search, along with their standard Google Site Search. Neither of these products come in a physical blue or yellow box, and I wouldn’t expect Google’s next innovation to either.

There’s plenty lively discussion about this in the Enterprise Search Professionals discussion board on LinkedIn.

Fast to Lucene Solr: Choosing a Document Processing Pipeline for Solr

by Karen Lynn

One of the most powerful features of FAST ESP is its flexible document processing engine. The engine that ships with FAST ESP supports multiple document processing pipelines that comprise of multiple document processing stages. A document processing stage performs a document processing task and can add, modify or remove elements from a document before it is passed to the next stage in the pipeline. A simple example of processing stage would be one that processes a document’s URL element, ESP ships with many processing stages and several processing pipelines out of the box for handling both structured and unstructured documents. FAST ESP document processing engine also provides a Python plugin API to allow customers to create custom processing stages of their own, which is a feature we use heavily for our customer ESP installations.

Unfortunately, Solr does not offer the same robust support for document processing pipelines that ESP does. The ESP processing pipeline is document-centric while the Lucene Solr platform is field-centric. When a document is fed to ESP for processing, it is routed to processing stages in a processing pipeline that can access document elements generated by previous processing stages. This allows for complex and optimal operations that can leverage previous processing, such as reuse of a previously generated HTML DOM tree structures. When a document is passed to a Solr update handler, the document is broken up into a set of individual fields. Each field can have a set of processors known as Solr Analysis Filter that can be chained together for field processing before indexing occurs. While this is fine for content that has been heavily processed before being sent to Solr, individual filters lack the same level of access to other documents elements to easily support more complex processing behaviors.

Another difference between ESP and Solr platforms is that ESP’s document processing architecture allows it to be scaled independently from its indexing architecture. ESP’s document processing architecture is fully decoupled from its indexing architecture and is designed out-of-the-box to take advantage of multiple processor cores per machine and multiple document processor machines per cluster. Solr’s out-of-the-box document processing architecture is tightly coupled with its indexing architecture, making it difficult to independently scale Solr’s content processing capacity without adding the complexity and overhead of additional Solr services and Lucene indexes. When we work with multiple terabyte document sets, we find content processing tends to be the biggest bottleneck, so being able to scale content processing ability separately from indexing is mission critical.

If we want to leverage the power that Solr offers, but we need support for a more robust document processing framework, what are our options? There are quite a number of content processing frameworks we can chose from that we discovered during the course of our research. Some of the options currently available include, but are not limited to OpenPipeline OpenPipe, Pypes, UIMA, SMILA , Apache Commons Pipeline, Piped, Behemoth, and Cascading.

Most of these frameworks are written in Java which gives them access to an incredibly broad and diverse spectrum of Java libraries. Since Solr and Lucene are also written in Java, it might make a lot of sense to favor a Java processing framework from scratch, especially if you are more comfortable with Java as a programming language.

Since our clients tend to have highly customized document processing pipelines with many custom FAST ESP Python processing stages, we are heavily biased towards choosing a framework that minimizes the amount of code that would need to be migrated. Many of the available processing frameworks are written in Java, which would be fine if you prefer using Java and don’t have a large amount of currently working Python code to migrate. For our use cases, the decision of which framework to chose was incredibly simple given the option, so we chose Pypes for our migration solution.

For a full report on how we use Pypes for a Document Processing Engine including sample code, sign up for our free FAST to Lucene Solr White Paper here.

For Many Companies, Migration to a New Search Engine is Inevitable

by Karen Lynn

HADLEY, MA– March 12, 2012

In the world of Enterprise Search, everything is changing.  Companies who have been using Microsoft’s internal search engine, FAST Enterprise Search Platform, will be forced to make a change as Microsoft discontinues support for the search platform for companies using Linux as their operating system.  Anticipating the need for a solution, local technology consultants TNR Global is pleased to announce the release of a White Paper for migrating off FAST ESP to a new search engine, Solr.  The paper is titled Bridging the Gap: A Migration Path from Fast ESP to Apache Solr.

This effort began last October when TNR Global presented on the subject of migration from FAST to Solr at the open source conference, Apache Lucene Eurocon in Barcelona, Spain. The paper contains a case study with architecture overview, loading millions of documents into Solr indexes, evaluation and recommendation of tools to bridge the feature gap, migrating custom pipeline code, and the vastly improved ROI after implementation.  “It’s basically a road map for companies looking at options for migration, and we outline Solr as a very good option” said Karen E. Lynn, Director of Business Development.

“We have spent over 9 years working with the FAST ESP product and we understand the nuances of what customers have come to expect from the technology. We’ve identified Solr as a top choice for migrating off FAST as support for the product drops off” said Michael McIntosh, VP of Search Technologies and lead author of the paper. “Solr is an open source technology that has matured and is certainly stable enough for commercial use” said Chris Miles, Senior Software Engineer and contributor to the paper. “We’re excited about this migration option for our customers, and we believe over the long run, it will save them a lot of money and give them greater control over their search engine.”

This heavily anticipated paper will assist companies and organizations in planning their own FAST ESP to Apache Solr migrations and alert them to tools and techniques that can help them achieve a relatively painless process.  Several large blue chip companies have expressed interest in the paper.  “We’ve had a healthy response to the paper” said Lynn.

Internal search engines differ from public search engines like Google or Bing, in that an internal search engine only searches for content inside the company’s firewall.  Google cannot access internal content, therefore companies use search technology to make their content ‘findable.’ “Companies want to keep internal information safe and private.  But they still need to find it” explained Lynn.  “That’s why they need search technology integrated into their organization’s system.”

For more information on search engines, product search, web portals and search engine migration, visit TNR’s main website.  To receive a free copy of the white paper, click here.

TNR Global www.tnrglobal.com, is a systems design and integration company focused on enterprise search and cloud computing solutions for publishing companies, news sites, web directories, academia, enterprise, and SaaS companies. TNR’s past clients include the University of Massachusetts Amherst, Mass Art & Culture, InterNano, Innovara, and the Allegis Group. TNR Global is located at 245 Russell Street, Suite 10 in Hadley, Massachusetts. TNR Global serves clients throughout New England, nationally, and world-wide. Its offices are in Hadley and Greenfield, Massachusetts.

FAST ESP to Lucene Solr Presentation: Open Call for Questions

by Karen Lynn

TNR Global is excited to be participating in the Apache Lucene EuroCon conference in Barcelona.  Our own Michael McIntosh is scheduled to present:  “Enterprise Search: FAST ESP to Lucene Solr” Here is your chance to pre-load the discussion. Before Michael puts the final touches on his talk, he wants to know what issues or questions you may be have.  In the following video, he touches on some of the highlights of his upcoming talk, and asks for your input.

Enterprise Search: FAST ESP to Lucene Solr pre-conferece video - Click to Watch

Enterprise Search: FAST ESP to Lucene Solr pre-conf video

To participate in advance, send you questions or comments to:  fast2solr@tnrglobal.com.  While Michael cannot promise he will include your question or commentary in his actual talk, he will work to address them in an upcoming White Paper, to be released after the conference in November 2011. We look forward to hearing from you!

Migration Still Looms Large on the Horizon for FAST ESP Customers

by Karen Lynn

Microsoft acquired FAST all the way back in 2008 and then in early 2010 disclosed it’s plans to stop updating the FAST product on a Linux operating system after 2010, making FAST ESP 5.3 the latest and greatest, and very last update Linux users will see involving any improvements to the proprietary search platform. It was clear to anyone on Linux that a migration would need to occur, and as content grows, depending upon the size of your organization, that migration should probably happen sooner than later.

Buzz about migration ensued–an inevitable certainty for many companies, especially ones with huge amounts of data. But how many companies have jumped in with both feet? I had the opportunity to speak with an open source search engine expert who, along with the industry, believed that the move from Microsoft was a windfall for anyone in the business of enterprise search design and implementation. However, she admitted “we haven’t seen as large a response as we expected.”

This isn’t exactly surprising to everyone. “It’s coming” says our VP of Search Technologies, Michael McIntosh. “Corporations have an enormous investment in FAST ESP and it makes sense that they would be reluctant to move to something new until they absolutely have to.” That means, when their licenses expire.

“They will likely weigh the performance and support, or lack thereof, for the FAST ESP technical team with the timing of renewing a license and wait until they absolutely have to change to something else,” says McIntosh.

The purchase of Autonomy and the shift of HP from hardware to software could signal a recognition from Goliath HP the kind of growth opportunity enterprise search software offers, and that the “great shift” from FAST ESP to another search platform is very much on the horizon.

But as the clock continues to tick, companies using FAST ESP should be strategizing for migration now. “It’s an enormous undertaking to migrate an entire search solution from FAST to another platform. Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation. Solving challenging issues like that requires both creativity and expertise to address your needs.” says McIntosh. If a need for migration is eminent, there will be a real need for expertise in the field of enterprise search on both proprietary and open source platforms, depending upon several factors like size, in house talent, and growth expectations.

How is your company preparing for the discontinuation of support of FAST ESP?  Need guidance?  Contact us for pointers, analysis, or architecture for a full migration.

Migration from FAST ESP to Lucene Solr

by Tamar Schanfeld

Download the presentation and see the video.

Michael McIntosh, Vice President of Enterprise Search Technologies at TNR, spoke at the Lucene Revolution conference in Boston, MA October 7-8, 2010. Michael reviewed the migration from Fast ESP to Lucene/Solr open source search. He discussed approaches to identifying core content areas of HTML documents such as Text-To-Tag Ratio Heuristics and Page Stereotype/Site Template Analysis, and reviewed specific use cases that we have encountered as search integration experts and discuss available tools.

TNR Global was a sponsor of Lucene Revolution. The conference gathered over 400 professionals from the enterprise search industry. We were happy to see so much interest in Lucene/Solr open source search, and get to know and learn from the folks who have done large scale implementations, including Twitter, LinkedIn, and eHarmony.  Not surprisingly, there was a lot of interest about migration from proprietory search systems to Solr, especially from FAST ESP due to Microsoft’s discontinuing FAST ESP support for Linux.  If you would like to learn more about how a migration from FAST ESP to Lucene Solr can benefit your company, contact us for a free consultation.

Dynamic Fields in Apache Solr

by Jeff Peck

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it.

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.

How to Index a Site with Python Using solrpy and a Sitemap

by Jeff Peck

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.