Posts Tagged ‘Solr’

Search System Audit

by Karen Lynn

Selecting a search platform is a big decision. You may already have a solution in place that is ineffective and you need a change. But what’s not working, and why? Which search engine is best for my needs? We perform an audit and full evaluation of your current search architecture and report our recommendations for optimizing what you already have, and list attributes of alternative technologies like Solr, LucidWorks Enterprise, ElasticSearch, SearchBlox or other solutions. Our full evaluation includes:

  1. Audit of existing enterprise search solution and evaluation of available options for improvement.
  2. Report on current performance, areas of targeted improvement, and recommendations
  3. Migration Plan. At the conclusion of the audit engagement, we provide a step by step plan on what needs to be done to migrate to a search solutions that supports your business goals and needs

Our audits are conducted by one of our senior software developers and takes place over a 3 day period. We will interviewing relevant staff members to gain valuable insights to pain points on performance. Our developer will access your back-end systems and review the following areas:

  • Review all source code, if available and applicable
  • Crawling web resources: pages and documents, forums, blogs Content processing and conversion
  • Content enhancement and extraction
  • PDF search by page
  • Search and database integration architecture design, scaling, performance, tuning, hardware requirements, and related expertise
  • Related searches, database indexing, vertical search engines, custom scoring/ranking, etc.
  • Performance troubleshooting (slow queries, high memory usage)
  • Database and file-system indexing, web crawling and searching
  • Document parsing and information extraction
  • Large-scale indexing and searching, distributed search
  • Crawling web resources: CMS based sites, blogs, forums, static html, PDFs, docs
  • Content processing and conversion
  • Content enhancement and extraction
  • PDF search results by page, mutli-page indexing, and multi-page image caching
  • Search and database integration

After this extensive review, we can better advise on which search platform to move your systems. Your business needs are weighed heavily in this evaluation, because after all, your search system is a core technology that enables knowledge management and data extraction for business decision making. We conduct search architecture audits for web portals, online directories, online databases, for the enterprise, and for virtually any web based application. Audits are the first step in making the right decision about a search platform. For more information on how a search audit can assist you in making a smarter technology decision for your organization, contact us.

FAST ESP to Lucene Solr Presentation: Open Call for Questions

by Karen Lynn

TNR Global is excited to be participating in the Apache Lucene EuroCon conference in Barcelona.  Our own Michael McIntosh is scheduled to present:  “Enterprise Search: FAST ESP to Lucene Solr” Here is your chance to pre-load the discussion. Before Michael puts the final touches on his talk, he wants to know what issues or questions you may be have.  In the following video, he touches on some of the highlights of his upcoming talk, and asks for your input.

Enterprise Search: FAST ESP to Lucene Solr pre-conferece video - Click to Watch

Enterprise Search: FAST ESP to Lucene Solr pre-conf video

To participate in advance, send you questions or comments to:  fast2solr@tnrglobal.com.  While Michael cannot promise he will include your question or commentary in his actual talk, he will work to address them in an upcoming White Paper, to be released after the conference in November 2011. We look forward to hearing from you!

Migration Still Looms Large on the Horizon for FAST ESP Customers

by Karen Lynn

Microsoft acquired FAST all the way back in 2008 and then in early 2010 disclosed it’s plans to stop updating the FAST product on a Linux operating system after 2010, making FAST ESP 5.3 the latest and greatest, and very last update Linux users will see involving any improvements to the proprietary search platform. It was clear to anyone on Linux that a migration would need to occur, and as content grows, depending upon the size of your organization, that migration should probably happen sooner than later.

Buzz about migration ensued–an inevitable certainty for many companies, especially ones with huge amounts of data. But how many companies have jumped in with both feet? I had the opportunity to speak with an open source search engine expert who, along with the industry, believed that the move from Microsoft was a windfall for anyone in the business of enterprise search design and implementation. However, she admitted “we haven’t seen as large a response as we expected.”

This isn’t exactly surprising to everyone. “It’s coming” says our VP of Search Technologies, Michael McIntosh. “Corporations have an enormous investment in FAST ESP and it makes sense that they would be reluctant to move to something new until they absolutely have to.” That means, when their licenses expire.

“They will likely weigh the performance and support, or lack thereof, for the FAST ESP technical team with the timing of renewing a license and wait until they absolutely have to change to something else,” says McIntosh.

The purchase of Autonomy and the shift of HP from hardware to software could signal a recognition from Goliath HP the kind of growth opportunity enterprise search software offers, and that the “great shift” from FAST ESP to another search platform is very much on the horizon.

But as the clock continues to tick, companies using FAST ESP should be strategizing for migration now. “It’s an enormous undertaking to migrate an entire search solution from FAST to another platform. Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation. Solving challenging issues like that requires both creativity and expertise to address your needs.” says McIntosh. If a need for migration is eminent, there will be a real need for expertise in the field of enterprise search on both proprietary and open source platforms, depending upon several factors like size, in house talent, and growth expectations.

How is your company preparing for the discontinuation of support of FAST ESP?  Need guidance?  Contact us for pointers, analysis, or architecture for a full migration.

Migration from FAST ESP to Lucene Solr

by Tamar Schanfeld

Download the presentation and see the video.

Michael McIntosh, Vice President of Enterprise Search Technologies at TNR, spoke at the Lucene Revolution conference in Boston, MA October 7-8, 2010. Michael reviewed the migration from Fast ESP to Lucene/Solr open source search. He discussed approaches to identifying core content areas of HTML documents such as Text-To-Tag Ratio Heuristics and Page Stereotype/Site Template Analysis, and reviewed specific use cases that we have encountered as search integration experts and discuss available tools.

TNR Global was a sponsor of Lucene Revolution. The conference gathered over 400 professionals from the enterprise search industry. We were happy to see so much interest in Lucene/Solr open source search, and get to know and learn from the folks who have done large scale implementations, including Twitter, LinkedIn, and eHarmony.  Not surprisingly, there was a lot of interest about migration from proprietory search systems to Solr, especially from FAST ESP due to Microsoft’s discontinuing FAST ESP support for Linux.  If you would like to learn more about how a migration from FAST ESP to Lucene Solr can benefit your company, contact us for a free consultation.

Dynamic Fields in Apache Solr

by Jeff Peck

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it.

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.  If you need assistance with setting up or integrating Solr, contact us.  We can step in at any point in a project to offer assistance or perform and entire upgrade to Solr.

How to Index a Site with Python Using solrpy and a Sitemap

by Jeff Peck

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you need assistance with any part of the Solr implementation, feel free to contact us.  We can get you started, step in and assist, or perform and entire integration of a Solr solution.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.