Crawling Solr

“We are looking at creating a suitable enterprise crawler to replace the one provided by ESP to support customers doing a ESP to Solr migration.”

Recently there has been a lively discussion on Linked In’s Enterprise Search Engine Professionals Group started with this question:


“Is it an handicap for Solr to depend on third party solutions for crawling the Web like Nutch?


Our own Michael McIntosh felt compelled to respond. What follows is his post to this topic in it’s entirety.


“This topic makes me think of the saying “Write programs that do one thing and do it well.” The longer version of this philosophy, as expressed by Doug McIlroy, is this: “Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” Solr stands very well on its own and, based upon my impression of the Solr community so far, more people currently use Solr for structured content vs unstructured like web documents. I think that Solr should have some ‘out of the box’ web crawler implementation available, but it should not be the core focus. It can serve to allow new users of Solr to focus more on the Solr/Lucene side of things and not have to worry about rolling their own crawler or figuring out which is the best third-party crawling solution to use. I suspect that many people who need to do crawling can get by with a fairly basic crawler. My impression of Nutch so far is that is more complicated than most Solr users need out of the starting gate. That said, if you have a business that deals with large amounts of crawled unstructured content, its very likely they will need something more robust than you can reasonably ship & support as part of the Solr project. For one of our clients, the size of our dataset has grown from needed just a couple boxes, to multiple clusters with many machines each. One of the newest developments is the growth of the amount of unstructured content has grown to a size where we now need a crawler CLUSTER. When we first started on this, it never occurred to us that we might need multiple machines for the crawling side of the equation, but it has happened. But I think our case its less common. All in all, I think Solr should have a bare-bones reference implementation of a crawler that can easily be expanded upon, but it is probably not an effective use of effort to Solr developers to focus on the crawling side. Let a third party focus on the issues of crawling, it is a deceptively complicated issue.”


After his post I caught him in the office and asked where he was going with this line of thinking. “We are looking at creating a suitable enterprise crawler to replace the one provided by ESP to support customers doing a ESP to Solr migration.” He revealed. Sounds like a very promising solution to a fairly big, and common problem for companies with vast amounts of metadata. And as for unstructured content? Well, it’s the proverbial elephant in the room, don’t you think?


To see the entire conversation, with contributions from experts in the field of search architecture, click here. To get in touch with Michael directly to discuss your architecture and crawling needs, contact us.

Dynamic Fields in Apache Solr

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it.

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.