solr – Page 3 – TNR Global

Solr in the Cloud – SolrHQ

Use the power of Solr search for your website with SolrHQ, a hosted Solr search solution that is easy to set up and scales with your growth. We offer plans to meet the needs of any sized organization. SolrHQ runs in the cloud for 24/7 up time reliability. We offer free tech support and free set up when you sign up for SolrHQ.

This is a cloud based service that can be used for searching an online site or a private (enterprise level) network. The service can handle anywhere from a few thousand documents/pages into the millions. SolrHQ services to widely used Content Management Systems WordPress and Joomla! Our easy to deploy plug-in for your CMS can bring truly powerful search to your site, allowing your users to find and discover the content you’ve worked hard to create.

To get started, sign up at SolrHQ for a free account. The technology is free to anyone who opens an account. If you need assistance with set up or installation, contact us. A member of our team will schedule a time to review the service, discuss the short set up process, and schedule the SolrHQ installation for your site. The technology is free, and set up is for a small fee depending upon the size of your data set. Contact us for more information.

Legal: Lucene is an open source search engine project of the Apache Software Foundation. Per ASF: “Apache Lucene is a high-performance, full-featured text search engine software library written entirely in Java.” Solr is a subproject of the Lucene project that created a fully featured search engine. Per ASF: “Apache Solr is a software product that provides an enterprise search server and services based on Apache Lucene.” Lucene and Solr are trademarks of the Apache Software Foundation.

Dynamic Fields in Apache Solr

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it.

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

The script provided here is a basic starting point to building the Solr index for any website with a sitemap, within minutes. Simply modify the script to use your Solr instance and run with a path to your valid XML sitemap and it will begin populating your Solr index.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

Solrpy – http://code.google.com/p/solrpy/
BeautifulSoup – http://www.crummy.com/software/BeautifulSoup/

You can install these using easy_install or manually.

You will also require an Apache Solr instance. (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

Ideally you will use this script on your own sitemap. For detailed information on how to construct your sitemap click here: http://www.sitemaps.org/protocol.php. You can search the web for other scripts that will automatically make sitemaps out of common CMS’s like WordPress and Joomla. There are also sitemap generators available. You can also find a valid sitemap for testing here: http://www.google.com/sitemap.xml (~4Mb). We will assume that you have have a valid sitemap.

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

# How many iterations max?  Enter 0 for no limit.
limit = 0 

# The URL of the solr instance
solrUrl = 'http://localhost:8080/sitemap-indexer-test'

# The xmlns for the sitemap schema
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9'

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		# For testing, if the limit is reached, break
		break;

	url = urlElem.text # Get the url text from the element

	try: # Try to get the page at url
		response = urllib2.urlopen(url)
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try: # Try to parse the HTML of the page
		soup = BeautifulSoup(response.read())
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try: # Try to set the title
		title = soup.html.head.title.string.decode("utf-8")
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try: # Try to set the body
		body = str(soup.html.body).decode("utf-8")
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try: # Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try: # Try to commit the additions
	solrInstance.commit()
except:
	print "Could not Commit Changes to Solr Instance - check logs"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

It will start to go through the sitemap, parsing the content of each URL and if no errors found will add it to the Solr index. This process can take several minutes. There may be errors parsing many of the documents. They will simply be skipped, you may have to fine-tune the parser to fit your specific needs.

Once finished, it will output the number of documents that were committed to the Solr index.

You should be able to access your Solr Instance and run queries. There are numerous resources on the web to help you form query strings. There is also a query form in your Solr web admin interface that allows setting the various request parameters.

If you experience Solr Exceptions, check your Solr logs. If you modified your schema, be sure to reload your Solr instance as this may be the cause of Unrecognized Field Exceptions. You can find the default Solr schema in the example/solr/ directory of a new install of Solr.

If you would like to parse the documents for more specific tags than simply taking the entire body element (as this script does), refer to this documentation:
http://www.crummy.com/software/BeautifulSoup/documentation.html.

How to Index a Site with Python Using solrpy and a Sitemap

If you are looking for a fast and easy way to populate a Solr instance using Python, read on.

While you certainly can modify this script to fit your specific needs, you may even find that this script satisfies your Solr indexing requirements as-is.

To start, you need to be running Python 2.6 and have the following modules installed:

Solrpy – http://code.google.com/p/solrpy/
BeautifulSoup – http://www.crummy.com/software/BeautifulSoup/

You can install these using easy_install or manually.

You will also require an Apache Solr instance. (If you are looking for fully managed solution for hosting your Solr search application with a wide range of services, feel free to contact us.)

We will also assume that you have the default Solr schema.xml installed.

Write the following python script sitemap-indexer.py, replacing the value for solrUrl with the location of your own instance:

#! /usr/bin/env python26
""" Index links from a sitemap to a SOLR instance"""

import sys
from BeautifulSoup import BeautifulSoup
import solr
import hashlib
import urllib2
from xml.etree.ElementTree import parse

limit = 0 # How many iterations max?  Enter 0 for no limit.
solrUrl = 'http://localhost:8080/sitemap-indexer-test' # The URL of the solr instance
sitemaps_ns = 'http://www.sitemaps.org/schemas/sitemap/0.9' # The xmlns for the sitemap schema

if len(sys.argv) != 2:
	print 'Usage: ./sitemap-indexer.py path'
	sys.exit(1)

sitemapTree = parse(sys.argv[1])

solrInstance = solr.SolrConnection(solrUrl) # Solr Connection object

counter = 0
numAdded = 0

# Find all of the URLs in the form <url>...<loc>URL</loc>...</url>
for urlElem in sitemapTree.findall('{%s}url/{%s}loc'%(sitemaps_ns,sitemaps_ns)):
	counter = counter + 1 # Increment counter

	if limit > 0 and counter > limit:
		break; # For testing, you can set a limit to how many pages of the sitemap to consider

	url = urlElem.text # Get the url text from the element

	try:
		response = urllib2.urlopen(url) # Try to get the page at url
	except:
		print "Error: Cannot get content from URL: "+url
		continue # Cannot get HTML.  Skip.

	try:
		soup = BeautifulSoup(response.read()) # Try to parse the HTML of the page
	except:
		print "Error: Cannot parse HTML from URL: "+url
		continue # Cannot parse HTML.  Skip.

	if soup.html == None: # Check if there is an <html> tag
		print "Error: No HTML tag found at URL: "+url
		continue #No <html> tag.  Skip.

	try:
		title = soup.html.head.title.string.decode("utf-8") # Try to set the title
	except:
		print "Error: Could not parse title tag found at URL: "+url
		continue #Could not parse <title> tag.  Skip.

	try:
		body = str(soup.html.body).decode("utf-8") # Try to set the body
	except:
		print "Error: Could not parse body tag found at URL: "+url
		continue #Could not parse <body> tag.  Skip.

	# Note, decode("utf-8") is used to avoid non-ascii characters in the solrInstance.add below

	# Get an md5 hash of the url for the unique id
	url_md5 = hashlib.md5(url).hexdigest()

	try:
		# Add to the Solr instance
		solrInstance.add(id=url_md5,url_s=url,text=body,title=body)
	except Exception as inst:
		print "Error adding URL: "+url
		print "\tWith Message: "+str(inst)
	else:
		print "Added Page \""+title+"\" with URL "+url
		numAdded = numAdded + 1

try:
	solrInstance.commit() # Commit the additions
except:
	print "Could not Commit Changes to SOLR Instance - Check SOLR logs for more info"
else:
	print "Success. "+str(numAdded)+" documents added to index"

Make the script executable and run it:
./sitemap-indexer.py /path/to/sitemap.xml

Once finished, it will output the number of documents that were committed to the Solr index.

Lucene Solr Resources

Solr: Solr is a high performance search server built using Lucene Java, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.

Lucene: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java and embedded in Solr. It is suitable for nearly any application that requires full-text search. Lucene is designed to be embedded into projects in order to provide full-text search capabilities. Solr has many more features and administration capabilities including searching structured data without requiring custom code, loading data from CSV files, tolerant parsing of user input, faceted searching, highlighting matched text in results, and retrieving search results in a variety of formats (XML, JSON, …).

Solr Resources:

Lucene Resources:

Solr homepage: tutorials, news, downloads
Solr community: mailing lists for users, developers, and commits
Solr Resources /Useful Links
Solr Product Sheets
Solr Javadocs
Latest Solr Release
Solr Discussion Thread
Stackoverflow Solr Q&A
The Solr Wiki: the Wiki with documents on Solr
- Solr Plugins: Options to customize Solr
- CollectionDistribution: Solr replication
- SearchComponent: Extend Solr’s search capabilities
- DataImportHandler: How to use data import hanlder
- SpellCheckComponent: Solr’s spell-check capabilities
- Distributed Search: Pros and cons of distributed search
- The Porter Stemming Algorithm: The stemming algorithm used by Solr

Lucene homepage: tutorials, news, downloads
Lucene Java home
Latest Lucene Code
Lucene In Action 2nd Edition: a book by Otis Gospodnetic, Erik Hatcher, and Mike McCandless, Manning
- Blog: Grant’s Grunts: Lucene Edition
Lucene community: mailing lists
Lucene Discussion Thread
Stackoverflow Lucene Q&A
Lucene IRC Chat
Latest Solr Release
The Lucene Wiki: the Wiki with documents on Lucene
- Luke: A tool for examining the contents of a Lucene index
LucidWorks Certified Distribution for Lucene

Lucene Solr Services

Solr is a powerful open source, scalable, cross-platform search engine. Solr has high performance features comparable to proprietary search engines like faceted search, full text search, rich document handling and dynamic clustering. TNR Global is a regular presenter at Lucene Solr Conferences worldwide and an active member of the Solr open source community. Since Solr is open source technology, the source code is free. Contact us to implement and integrate this robust search engine into your organization.

TNR Global offers Lucene Solr consulting and integration services for:

Software or SaaS System developers
IT/MIS system administrators
Corporate data administrators
Current Linux based FAST ESP users
Marketing departments of content intensive web sites

Our Services with Lucene Solr:

We integrate solutions using Lucene Solr for commercial grade Lucene Solr products through our partners at Lucid Imagination. We also develop tailor made solutions using Lucene Solr for the following:

Crawling web resources: pages and documents, forums, blogs
Content processing and conversion
Content enhancement and extraction
PDF search by page
Alternative search for SharePoint, email
Search and database integration
Audits and Upgrades to your current Solr installation

We leverage the power of Lucene Solr combined with the latest content enhancement approaches to provide more diversified search service offerings for clients. Years of hands-on experience give us an advantage; we understand the issues and the subtle nuances of data. Contact us for a free consultation.

Lucene Solr

Solr is an open source high performance, scalable, cross-platform search engine. Solr has features which were once only found in commercial offerings.

We work with Lucene Solr, a Java-based open source search library, to power search for web sites.

Automatic replication for large installations with distributed search
Java-API (SolrJ)
Conversion of Office-documents
Full faceted search
Advanced tokenization, highlighting and stemming

With open source solutions, there are no licensing fees, and the technology can be customized to meet your specific needs. Solr powers search for large commercial organizations including sites at eBay, MTV Networks, Netflix, CNET, and Zappos.

We leverage the power of Lucene Solr combined with the latest content enhancement approaches to provide more diversified search service offerings for clients.

Years of hands-on experience gives us an advantage; we understand the subtle nuances of the data.

Lucene / Solr

Solr is an open source, high performance, scalable, cross-platform search engine. Solr has features which were once only found in commercial offerings such as full text search, faceted search, rich document handling and dynamic clustering. Read more…

Enterprise Search White Papers and Presentations

Below are links to our mini white papers, addressing questions about enterprise search and more.

Enterprise Search Basics (pdf:157,797) Enterprise Search and Government (pdf:112,083) Enterprise Search for Law Firms (pdf:110,350) Enterprise Search and E-Discovery (pdf:108,855)

Please contact us for additional information.

Migration from FAST ESP to Lucene Solr, by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global

Video from the Lucene Revolution conference in Boston, MA.

Migration from FAST ESP to Lucene Solr (PDF) (pdf:4,067,091)

presented by Michael McIntosh, VP, Enterprise Search Technologies, TNR Global at the Lucene Revolution Conference in Boston, MA.

There are many reasons that an IT department with a large scale search installation would want to move from a proprietary platform to Lucene Solr. In the case of FAST Search, the company’s purchase by Microsoft and discontinuation of the Linux platform has created an urgency for FAST users.

This presentation compares Lucene/Solr to FAST ESP on a feature basis, and as applied to an enterprise search installation. We explore how various advanced features of commercial enterprise search platforms can be implemented as added functions for Lucene/Solr. Actual cases are presented describing how to map the various functions between systems.

Open Source Search Solutions

TNR Global provides enterprise search implementation services throughout the entire implementation cycle.
We help evaluate different vendor options, audit existing solutions, implement new solutions, upgrade existing solutions, and provide ongoing support for implemented solutions.

We specialize in Lucene Solr development and implementations. We also have experience with other open source search systems: ElasticSearch for Big Data, SearchBlox, Sphinx, Hadoop, HBase, Lemur/Indri, Nutch, SWISH-E, and OpenFTS. Contact us for a free consultation.