Solr in the Cloud – SolrHQ

my_Logo-SolrHQ1

Use the power of Solr search for your website with SolrHQ, a hosted Solr search solution that is easy to set up and scales with your growth.  We offer plans to meet the needs of any sized organization.  SolrHQ runs in the cloud for 24/7 up time reliability.  We offer free tech support and free set up when you sign up for SolrHQ. 

This is a cloud based service that can be used for searching an online site or a private (enterprise level) network.  The service can handle anywhere from a few thousand documents/pages into the millions. SolrHQ services to widely used Content Management Systems WordPress and Joomla!  Our easy to deploy plug-in for your CMS can bring truly powerful search to your site, allowing your users to find and discover the content you’ve worked hard to create.

To get started, sign up at SolrHQ for a free account.  The technology is free to anyone who opens an account.  If you need assistance with set up or installation, contact us. A member of our team will schedule a time to review the service, discuss the short set up process, and schedule the SolrHQ installation for your site. The technology is free, and set up is for a small fee depending upon the size of your data set.  Contact us for more information.

Legal: Lucene is an open source search engine project of the Apache Software Foundation. Per ASF: “Apache Lucene is a high-performance, full-featured text search engine software library written entirely in Java.” Solr is a subproject of the Lucene project that created a fully featured search engine. Per ASF: “Apache Solr is a software product that provides an enterprise search server and services based on Apache Lucene.” Lucene and Solr are trademarks of the Apache Software Foundation.

 

 

Dynamic Fields in Apache Solr

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it.

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes.  Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

You can install these using easy_install or manually.

You will also require an Apache Solr instance.  (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap.  For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php.  You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla.  There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python26
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

limit = 0 # How many iterations max?  Enter 0 for no limit.
solrUrl = 'http://localhost:8080/sitemap-indexer-test' # The URL of the solr instance
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9' # The xmlns for the sitemap schema

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		break; # For testing, you can set a limit to how many pages of the sitemap to consider

	url = urlElem.text # Get the url text from the element

	try:
		response = urllib2.urlopen(url) # Try to get the page at url
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try:
		soup = BeautifulSoup(response.read()) # Try to parse the HTML of the page
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try:
		title = soup.html.head.title.string.decode("utf-8") # Try to set the title
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try:
		body = str(soup.html.body).decode("utf-8") # Try to set the body
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Note, decode("utf-8") is used to avoid non-ascii characters in the solrInstance.add below

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try:
		# Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try:
	solrInstance.commit() # Commit the additions
except:
	print "Could not Commit Changes to SOLR Instance - Check SOLR logs for more info"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

Lucene Solr Resources

solr_FCSolr: Solr is a high performance search server built using Lucene Java, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.

luceneLucene: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java and embedded in Solr. It is suitable for nearly any application that requires full-text search. Lucene is designed to be embedded into projects in order to provide full-text search capabilities.  Solr has many more features and administration capabilities including searching structured data without requiring custom code, loading data from CSV files, tolerant parsing of user input, faceted searching, highlighting matched text in results, and retrieving search results in a variety of formats (XML, JSON, …).

Solr Resources:

Lucene Resources:

 

 

 

Lucene Solr

solr_FC1Solr is an open source high performance, scalable, cross-platform search engine. Solr has features which were once only found in commercial offerings.

lucene_logo

We work with Lucene Solr, a Java-based open source search library, to power search for web sites.

  • Automatic replication for large installations with distributed search
  • Java-API (SolrJ)
  • Conversion of Office-documents
  • Full faceted search
  • Advanced tokenization, highlighting and stemming

 

With open source solutions, there are no licensing fees, and the technology can be customized to meet your specific needs. Solr powers search for large commercial organizations including sites at eBay, MTV Networks, Netflix, CNET, and Zappos.

We leverage the power of Lucene Solr combined with the latest content enhancement approaches to provide more diversified search service offerings for clients.

Years of hands-on experience gives us an advantage; we understand  the subtle nuances of the data.

Lucene Solr Services

solr_FC1Lucene

Solr is a powerful open source, scalable, cross-platform search engine. Solr has high performance features comparable to proprietary search engines like faceted search, full text search, rich document handling and dynamic clustering. TNR Global is a regular presenter at Lucene Solr Conferences worldwide and an active member of the Solr open source community. Since Solr is open source technology, the source code is free. Contact us to implement and integrate this robust search engine into your organization.

TNR Global offers Lucene Solr consulting and integration services for:

  • Software or SaaS System developers
  • IT/MIS system administrators
  • Corporate data administrators
  • Current Linux based FAST ESP users
  • Marketing departments of content intensive web sites

Our Services with Lucene Solr:

We integrate solutions using Lucene Solr for commercial grade Lucene Solr products through our partners at Lucid Imagination.  We also develop tailor made solutions using Lucene Solr for the following:lucid_imagination_logo

  • Crawling web resources: pages and documents, forums, blogs
  • Content processing and conversion
  • Content enhancement and extraction
  • PDF search by page
  • Alternative search for SharePoint, email
  • Search and database integration
  • Audits and Upgrades to your current Solr installation

We leverage the power of Lucene Solr combined with the latest content enhancement approaches to provide more diversified search service offerings for clients. Years of hands-on experience give us an advantage; we understand the issues and the subtle nuances of data.  Contact us for a free consultation.

Enterprise Search White Papers and Presentations

Below are links to our mini white papers, addressing questions about enterprise search and more.

Enterprise Search Basics (pdf:157,797) Enterprise Search and Government (pdf:112,083) Enterprise Search for Law Firms (pdf:110,350) Enterprise Search and E-Discovery (pdf:108,855)

Please contact us for additional information.

Migration from FAST ESP to Lucene Solr, by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global

Video from the Lucene Revolution conference in Boston, MA.

from-fast-esp-to-solr

Migration from FAST ESP to Lucene Solr (PDF) (pdf:4,067,091)
presented by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global at the Lucene Revolution Conference in Boston, MA.
There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.
This presentation compares Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases are presented describing how to map the various functions between systems.

Open Source Search Solutions

TNR Global provides enterprise search implementation services throughout the entire implementation cycle.
We help evaluate different vendor options, audit existing solutions, implement new solutions, upgrade existing solutions, and provide ongoing support for implemented solutions.

We specialize in Lucene Solr development and implementations. We also have experience with other open source search systems: ElasticSearch for Big Data, SearchBlox, Sphinx, Hadoop, HBase, Lemur/Indri, Nutch, SWISH-E, and OpenFTS. Contact us for a free consultation.

solr_FCelasticsearch_smalllucene_logo1             hadoop_small

 imagesCABWQ4PZ                  logo_redhat               logo_mysql

 openfts                   logo_lemur_sm              logo_linux

      

.