data – TNR Global

Search Fuels Business Intelligence for Decision Making

“The jungle is dark, but full of diamonds.” said Arthur Miller. Can your search technology find the gems buried inside your own business?

“The jungle is dark, but full of diamonds.” said Arthur Miller. The same can be said about the invaluable data inside your business. It’s there, ready to be mined. But unless you have the right tools, you’ll never get to those diamonds.

Content is expanding at an exponential rate. I don’t know anyone in any business who can keep up with the pace of content growth, without the use of powerful search engines to find and extract relevant information. Business analysts expect content to grow 800% over the next 5 years. Business intelligence requires extraction of the right information, and most enterprises have both structured and unstructured data. Structured data is easy for most search engines to search. The rub is in unstructured content–of which there is abundance. Unstructured content is said to account for 70-80% of data in all organizations. This type of content is often in the form of documents, email messages, health records, HTML pages, books, metadata, audio, video, and various other files. All these files have to be “cleaned up” before feeding them through a search engine in order to get results with any kind of value or relevance.

Mining this data is going to be essential for not just the success, but the survival of many businesses. James Kobielus, an analyst at Forrester Research, reports in an interview with ComputerWorld that businesses will increasingly turn to a self-service BI throughout 2011 and beyond. “Increasingly, enterprises will adopt new Web-based interactive querying and reporting tools that are designed to put more data analytics capabilities into the hands of end users,” he said. A good search engine that can find data quickly and easily can “take the burden off IT and speed up the development of reports to a considerable degree,” Kobielus said. The information mined by a search engine tuned to the specific business needs facilities better decision making for people a every job function within the enterprise. “Because every business is a little different, and so many organizations house so much unstructured content, most search engines can’t cover everything that is needed without some customization” said Michael McIntosh, our VP of Search Technologies at TNR Global. “Data conditioning is vital to unstructured content. Without someone paying attention to filtering out the garbage in unstructured content, you’re not going to get a good search result. The last thing a business needs is it’s search results working against them.”

“The jungle is dark, but full of diamonds.” said Arthur Miller. Can your search technology find the gems buried inside your own business?

For more information on how data mining and a customized search engine can move your business forward, contact us for a free consultation.

Migration Still Looms Large on the Horizon for FAST ESP Customers

“Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation.” –Michael McIntosh, VP of Search Technologies, TNR Global, LLC

Microsoft acquired FAST all the way back in 2008 and then in early 2010 disclosed it’s plans to stop updating the FAST product on a Linux operating system after 2010, making FAST ESP 5.3 the latest and greatest, and very last update Linux users will see involving any improvements to the proprietary search platform. It was clear to anyone on Linux that a migration would need to occur, and as content grows, depending upon the size of your organization, that migration should probably happen sooner than later.

Buzz about migration ensued–an inevitable certainty for many companies, especially ones with huge amounts of data. But how many companies have jumped in with both feet? I had the opportunity to speak with an open source search engine expert who, along with the industry, believed that the move from Microsoft was a windfall for anyone in the business of enterprise search design and implementation. However, she admitted “we haven’t seen as large a response as we expected.”

This isn’t exactly surprising to everyone. “It’s coming” says our VP of Search Technologies, Michael McIntosh. “Corporations have an enormous investment in FAST ESP and it makes sense that they would be reluctant to move to something new until they absolutely have to.” That means, when their licenses expire.

“They will likely weigh the performance and support, or lack thereof, for the FAST ESP technical team with the timing of renewing a license and wait until they absolutely have to change to something else,” says McIntosh.

The purchase of Autonomy and the shift of HP from hardware to software could signal a recognition from Goliath HP the kind of growth opportunity enterprise search software offers, and that the “great shift” from FAST ESP to another search platform is very much on the horizon.

But as the clock continues to tick, companies using FAST ESP should be strategizing for migration now. “It’s an enormous undertaking to migrate an entire search solution from FAST to another platform. Designing a non-trivial search solution to fully meet your needs from scratch is hard enough on its own. If you are migrating an existing solution, it is very unlikely that you will find a one to one mapping of all of the features in a new search engine that you have come to depend upon with your existing implementation. Solving challenging issues like that requires both creativity and expertise to address your needs.” says McIntosh. If a need for migration is eminent, there will be a real need for expertise in the field of enterprise search on both proprietary and open source platforms, depending upon several factors like size, in house talent, and growth expectations.

How is your company preparing for the discontinuation of support of FAST ESP? Need guidance? Contact us for pointers, analysis, or architecture for a full migration.

Continuous Integration for Large Search Solutions

Managing large projects takes a smart approach and some intuitive thinking. One project we are currently engaged in is with large publisher of manufacturing parts. This has been an extraordinary project due to its scale and ever changing scope. I spoke with our VP of Enterprise Search Technologies, Michael McIntosh about how TNR Global handles complex projects.

Karen: This project is a big one. Tell me more about the site’s function. What is the focus?

Michael: Product search is the focus. The site contains tens of millions of documents, both structured and unstructured content. They also have a huge amount of data provided by the advertisers and the companies themselves on products that they sell. One of the advantages we have over a search engine like Google is access to a vast amount of propriety data provided by the vendors themselves.

Karen: Tell me about how you are managing the project. What are some of the variables you work with?

Michael: With this particular project, we are dealing with many different data feeds. There are many different intermediary metadata stages we have to generate to support the final searchable content. The client also changes their business logic frequently enough that if it takes a month or more between data builds its likely something has changed. For instance, they might have changed an XML format or added an attribute to an element in the data feed that will break something else down the line. The problem is there are so many moving parts, it’s almost impossible to do it manually and always do it correctly.

Karen: What other kinds of business logic changes are you dealing with in top of the massive amounts of raw data?

Michael: Most of the business logic changes are when they need to modify how something behaves based on new data that’s available, or when they need to start treating the data in a different way. Sometimes there is a change in the way they want the overall system to behave. They sometimes have some classification rules for content they like to tweak occasionally.

Another thing we consider is the client’s relevancy scoring and query pre-processing rules. So you need to consider if you issue a query and it fails, what happens then? What kind of fallback query do you use? All these things are part of the business logic that is independent of the raw data. In summary, we have the raw data and we can do a number of things with it. They often want us to change exactly what we’re doing with it, how we’re conditioning it, and how we’re transforming it. We either tweak what exists or take advantage of new data that they’ve started including in their data feeds. The challenge is all these elements can change frequently.

Karen: This site is more of a portal than strictly an enterprise search project, isn’t it?

Michael: Yes. Enterprise search usually refers to searching for documents within an organization. This client is a public facing search engine that allows the public to perform product search across a very large number of vendors and service providers.

Changes come from their advertisers and data they provide. Advertisers come and go. People pay for placement within certain industrial categories. It’s not like we get a static list of sites to crawl and that’s that. It changes weekly, sometimes daily. This list of sites we crawl is on a weekly or daily basis. Also things need to be purged from the index. Say an advertiser’s contract ends and suddenly we need to stop crawling a site with thousands documents; that data needs to be purged from the index promptly. Not only do we have to crawl new sites but purge old ones as well. This is a project that is so massive that it’s not cut and dried. A lot of software development projects focus on a clear cut problem, come up with plan, tackle it, release it, and then maintain it. We’re constantly getting new information and learn new things about people hitting the site.

Karen: So this sounds like this project is always in a state of ongoing development.

Michael: We are building something that’s never been built before. One of the goals is to make this site remarkable. And we’re very excited to be a part of that. The scale of the project is quite big though, which is why we started using Continuous Integration.

The way our cycles work is we perform big data updates, but by using CI, we can continuously update and integrate new data. We’re moving to a place, by using the practice of CI, we can perform a daily builds which gives us the time we need to fix problems before we absolutely need it to be live.

Karen: How do you implement CI into your day to day management of the project?

Michael: There are some pretty great open source tools that we’re using to implement CI. We use Jenkins to help us do Continuous Integration for frequent data builds, which is an intensive process for this particular client.

We field questions from the client about the status of different data builds. We hope to use Jenkins in conjunction with other tools to automatically build data and have event-based data builds. We’re looking at a way to have it triggered by some other event and have Jenkins automatically generate reports as the data is being built. Each time we run a build script, if the output differs from the previous build, Jenkins makes it easy for you to see that something is different. There is a way to modify your output that Jenkins can understand. One of the cool things about Jenkins is they have graphs that illustrate differences to help us identify issues that could pose a potential problem and let us fix it before we need to go live with the data.

Karen: Any other tools?

Michael: For multi-node search clusters, we’re using a tool called fabric3 that uses SSH to copy data and execute scripts across multiple nodes of a cluster based upon roles. We have a clever set up where we’re able to inform fabric3 what services are running on each node in our cluster and have actions or commands linked to certain tasks, like building metadata. By linking them, they automatically know which nodes to deploy data to.

Using open source tools like Jenkins and fabric3 make it a lot more manageable considering the large number of moving parts. It’s allowed us to be successful in building this incredible site and making the search function relevant, accurate and up to date.

Dynamic Fields in Apache Solr

So, you’ve installed a fresh copy of <a taget=”_blank” href=”http://lucene.apache.org/solr/”>Apache Solr</a>. You have tested it out running the examples from the <a href=”http://lucene.apache.org/solr/tutorial.html”>Solr tutorial</a>. And now you are ready to start indexing some of your own data. Just one problem. The fields for your own data are not recognized by your Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are all defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your search project, and even if you are willing to put up with misnamed fields at least for experimenting with your instance, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to add to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields – fields that are defined in the schema with a glob-like pattern that is either at the beginning or end of the name. Further, there are pre-defined dynamic fields for most of the common data-types that you may use, in the default schema. Here are the some of the dynamic fields that are defined in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

The field names are defined with a glob-like pattern that is either at the beginning or end of the name. With the above dynamic fields, you can index data with field names that begin with any valid string and end in one of the suffixes in the name attributes (i.e. article_title_s, article_content_t, posted_date_dt, etc.) and Solr will dynamically create any dynamic field of the particular type with the name that you give it. After you’ve indexed some data, you can actually view this dynamically created field in the schema viewer for your instance, located at http;//YOUR-INSTANCE/admin/schema.jsp

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

Dynamic Fields in Apache Solr

So, you’ve installed a fresh copy of Apache Solr. You have tested it out running the examples from the Solr tutorial. And now you are ready to start indexing some of your own data. Just one problem. The fields for your data are not recognized by the default Solr instance. You notice in the schema.xml file that the default fields have names like cat, weight, subject, includes, author, title, payloads, popularity, price, etc. These fields are defined for the purpose of being used with the sample data provided with Solr. Most of their names are likely not relevant to your dataset, and even if you can manage to make things “fit” with misnamed fields even just for the purpose of experimenting, you also face the problem that their set properties may not be what you would expect them to be.

Of course you can modify the schema.xml file and apply strong data-typing to each field that you plan to use to fit the exact needs of your project, reload Solr, and then start to index your data. But if you are just getting started with Solr, or starting a new project and experimenting with adding to your dataset, you may not know exactly what fields you need to define or what properties to define for them. Or you might be interested updating an existing index with some additional fields, but do not want to explicitly add them to the schema.

Fortunately, Solr gives the option to define dynamic fields. Further, there are pre-defined dynamic fields for many of the common data-types in the default schema. Here are the some of the dynamic fields that are found in the default schema.xml:

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
<dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
<dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
<dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
<dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
<dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
<dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

<add>
<doc>
<field name="article_title_s">My Article</field>
<field name="article_content_t">Lorem Ipsum...</field>
<field name="posted_date_dt">1995-12-31T23:59:59Z</field>
</doc>
</add>

After you’ve indexed some data, you can actually view the dynamic field names in the schema viewer, located at http://YOUR-INSTANCE/admin/schema.jsp

Using dynamic fields is a great way to get started at using Apache Solr with minimal setup.

Sidekick Danger – It’s not the cloud, it’s the approach

I recently saw the headline,”T-Mobile and Microsoft/Danger data loss is bad for the cloud“, and, as an admin who works with cloud technology on a daily basis, viewed the headline with some concern. However, after reading the article itself, my only thought is “What does this have to do with the cloud?”. Reading through, we find that Microsoft/Danger stores your phone data (contacts, photos, etc) on it’s servers, and that the phone needs to constantly be in contact with the servers in order to maintain service and data. Unfortunately, the servers crashed, and all of the data was lost. Turn off your phone, lose all your data. Yet- this is exactly what the Sidekick service promises to protect you from- and it failed.

The problem with blaming this on the “cloud” is that, while technically, your cell phone and the Microsoft/Danger servers form a “cloud”, the failure lies with the servers, and those who administer those servers. It doesn’t matter whether those servers are virtual, or physical- if there is not a disaster recovery plan in place, and if that disaster recovery plan has not been tested- data will be lost. Your data. This is not a shortcoming of cloud computing- it is a result of depending on others to maintain your data. It also gives us caution when depending on external providers over the network to always be available. Services stop. Power fails. Disks die. Routing interruptions happen.

This is just network computing. But if the people (or companies) behind it all don’t do their own due diligence- disasters like the this, and worse will continue to happen.