unstructured content

Search and Steel Girders

“Search by itself may look like a simple box, but behind the box is a foundry of girders, cross beams, and structural support that allows you to find what you need.”

“Search ties people together…”

This was one of the many themes at the Enterprise Search Summit in Washington, DC last week. It seems like a fairly obvious statement, but it quickly becomes part of the landscape, taken for granted even though the landscape couldn’t function without it. I have compared search function to the steel girders of a skyscraper. When you walk into the building, you aren’t thinking about the beams holding the building up or connecting floors, but without them, you wouldn’t have a building at all (you couldn’t even find the lobby). Other metaphors overheard include oxygen (invisible yet essential), sunlight (lest we remain in the dark) and electricity (everything stops without it).

Attendees of the conference know how important search is to companies, but increasingly, companies are taking search for granted. There is a fundamental gap in communicating the importance and difficulty of implementing a good search platform.

Companies who need search to run on their website or intranet, expect search to work as it does on the Internet, but this is an apples and oranges scenario.

Here are the main disconnects:

Search is easy
Search is cheap
It never has to be touched again

People expect search inside the firewall to function much like Google does outside the firewall. Google exists for end users and is really, really incredible. It Geo-locates, it auto-completes. It uses your browsing history to provide more relevant results. And you had no financial investment in using this really lovely, elegant, useful tool that doesn’t just assist your Internet experience, but facilitates it. But behind the firewall, things are different. Let me explain.

Your business content isn’t publicly available or known. I mean, that would be bad, right? It’s behind the firewall for a reason. So keeping it there yet allowing your staff to access certain levels of information takes some architecture and planning.

Google has thousands of developers working on this beautiful, incredible technology every day. They finance this by ad content. How many people do you have on your search team? And how much of their day do they really spend on search? What department is being billed for it? Business leaders need to embrace this as a necessary cost of doing business and budget accordingly, or face the crippling result of staff and customers not being able to find the information they need.

80% of your content is unstructured. Meaning, search engines can’t really read it until some love and care is put into cleaning the data. This is a vital, yet time intense process. Our VP of Search Technologies Michael McIntosh says “We spend about 90% of our time on the document processing pipeline, conditioning data to be fed into the engine.” Moreover, unstructured data isn’t a set number. It’s being creating faster than you can blink by your entire enterprise. Processing it is never a done deal.

So if search connects us, hopefully this finds you thinking about search in more realistic terms. Search by itself may look like a simple box, but behind the box is a foundry of girders, cross beams, and structural support that allows you to find what you need to “make money outside the firewall or save money inside the firewall.”

Search Fuels Business Intelligence for Decision Making

“The jungle is dark, but full of diamonds.” said Arthur Miller. Can your search technology find the gems buried inside your own business?

“The jungle is dark, but full of diamonds.” said Arthur Miller. The same can be said about the invaluable data inside your business. It’s there, ready to be mined. But unless you have the right tools, you’ll never get to those diamonds.

Content is expanding at an exponential rate. I don’t know anyone in any business who can keep up with the pace of content growth, without the use of powerful search engines to find and extract relevant information. Business analysts expect content to grow 800% over the next 5 years. Business intelligence requires extraction of the right information, and most enterprises have both structured and unstructured data. Structured data is easy for most search engines to search. The rub is in unstructured content–of which there is abundance. Unstructured content is said to account for 70-80% of data in all organizations. This type of content is often in the form of documents, email messages, health records, HTML pages, books, metadata, audio, video, and various other files. All these files have to be “cleaned up” before feeding them through a search engine in order to get results with any kind of value or relevance.

Mining this data is going to be essential for not just the success, but the survival of many businesses. James Kobielus, an analyst at Forrester Research, reports in an interview with ComputerWorld that businesses will increasingly turn to a self-service BI throughout 2011 and beyond. “Increasingly, enterprises will adopt new Web-based interactive querying and reporting tools that are designed to put more data analytics capabilities into the hands of end users,” he said. A good search engine that can find data quickly and easily can “take the burden off IT and speed up the development of reports to a considerable degree,” Kobielus said. The information mined by a search engine tuned to the specific business needs facilities better decision making for people a every job function within the enterprise. “Because every business is a little different, and so many organizations house so much unstructured content, most search engines can’t cover everything that is needed without some customization” said Michael McIntosh, our VP of Search Technologies at TNR Global. “Data conditioning is vital to unstructured content. Without someone paying attention to filtering out the garbage in unstructured content, you’re not going to get a good search result. The last thing a business needs is it’s search results working against them.”

“The jungle is dark, but full of diamonds.” said Arthur Miller. Can your search technology find the gems buried inside your own business?

For more information on how data mining and a customized search engine can move your business forward, contact us for a free consultation.

Crawling Solr

“We are looking at creating a suitable enterprise crawler to replace the one provided by ESP to support customers doing a ESP to Solr migration.”

Recently there has been a lively discussion on Linked In’s Enterprise Search Engine Professionals Group started with this question:

“Is it an handicap for Solr to depend on third party solutions for crawling the Web like Nutch?

Our own Michael McIntosh felt compelled to respond. What follows is his post to this topic in it’s entirety.

“This topic makes me think of the saying “Write programs that do one thing and do it well.” The longer version of this philosophy, as expressed by Doug McIlroy, is this: “Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” Solr stands very well on its own and, based upon my impression of the Solr community so far, more people currently use Solr for structured content vs unstructured like web documents. I think that Solr should have some ‘out of the box’ web crawler implementation available, but it should not be the core focus. It can serve to allow new users of Solr to focus more on the Solr/Lucene side of things and not have to worry about rolling their own crawler or figuring out which is the best third-party crawling solution to use. I suspect that many people who need to do crawling can get by with a fairly basic crawler. My impression of Nutch so far is that is more complicated than most Solr users need out of the starting gate. That said, if you have a business that deals with large amounts of crawled unstructured content, its very likely they will need something more robust than you can reasonably ship & support as part of the Solr project. For one of our clients, the size of our dataset has grown from needed just a couple boxes, to multiple clusters with many machines each. One of the newest developments is the growth of the amount of unstructured content has grown to a size where we now need a crawler CLUSTER. When we first started on this, it never occurred to us that we might need multiple machines for the crawling side of the equation, but it has happened. But I think our case its less common. All in all, I think Solr should have a bare-bones reference implementation of a crawler that can easily be expanded upon, but it is probably not an effective use of effort to Solr developers to focus on the crawling side. Let a third party focus on the issues of crawling, it is a deceptively complicated issue.”

After his post I caught him in the office and asked where he was going with this line of thinking. “We are looking at creating a suitable enterprise crawler to replace the one provided by ESP to support customers doing a ESP to Solr migration.” He revealed. Sounds like a very promising solution to a fairly big, and common problem for companies with vast amounts of metadata. And as for unstructured content? Well, it’s the proverbial elephant in the room, don’t you think?

To see the entire conversation, with contributions from experts in the field of search architecture, click here. To get in touch with Michael directly to discuss your architecture and crawling needs, contact us.