Continuous Integration for Large Search Solutions

Managing large projects takes a smart approach and some intuitive thinking. One project we are currently engaged in is with large publisher of manufacturing parts. This has been an extraordinary project due to its scale and ever changing scope. I spoke with our VP of Enterprise Search Technologies, Michael McIntosh about how TNR Global handles complex projects.

Karen: This project is a big one. Tell me more about the site’s function. What is the focus?

Michael: Product search is the focus. The site contains tens of millions of documents, both structured and unstructured content. They also have a huge amount of data provided by the advertisers and the companies themselves on products that they sell. One of the advantages we have over a search engine like Google is access to a vast amount of propriety data provided by the vendors themselves.

Karen: Tell me about how you are managing the project.  What are some of the variables you work with?

Michael: With this particular project, we are dealing with many different data feeds. There are many different intermediary metadata stages we have to generate to support the final searchable content.  The client also changes their business logic frequently enough that if it takes a month or more between data builds its likely something has changed. For instance, they might have changed an XML format or added an attribute to an element in the data feed that will break something else down the line. The problem is there are so many moving parts, it’s almost impossible to do it manually and always do it correctly.

Karen: What other kinds of business logic changes are you dealing with in top of the massive amounts of raw data?

Michael: Most of the business logic changes are when they need to modify how something behaves based on new data that’s available, or when they need to start treating the data in a different way.  Sometimes there is a change in the way they want the overall system to behave. They sometimes have some classification rules for content they like to tweak occasionally.

Another thing we consider is the client’s relevancy scoring and query pre-processing rules. So you need to consider if you issue a query and it fails, what happens then? What kind of fallback query do you use?  All these things are part of the business logic that is independent of the raw data. In summary, we have the raw data and we can do a number of things with it. They often want us to change exactly what we’re doing with it, how we’re conditioning it, and how we’re transforming it. We either tweak what exists or take advantage of new data that they’ve started including in their data feeds. The challenge is all these elements can change frequently.

Karen: This site is more of a portal than strictly an enterprise search project, isn’t it?

Michael: Yes. Enterprise search usually refers to searching for documents within an organization. This client is a public facing search engine that allows the public to perform product search across a very large number of vendors and service providers.

Changes come from their advertisers and data they provide. Advertisers come and go. People pay for placement within certain industrial categories. It’s not like we get a static list of sites to crawl and that’s that. It changes weekly, sometimes daily. This list of sites we crawl is on a weekly or daily basis. Also things need to be purged from the index. Say an advertiser’s contract ends and suddenly we need to stop crawling a site with thousands documents; that data needs to be purged from the index promptly. Not only do we have to crawl new sites but purge old ones as well. This is a project that is so massive that it’s not cut and dried. A lot of software development projects focus on a clear cut problem, come up with plan, tackle it, release it, and then maintain it. We’re constantly getting new information and learn new things about people hitting the site.

Karen: So this sounds like this project is always in a state of ongoing development.

Michael: We are building something that’s never been built before. One of the goals is to make this site remarkable. And we’re very excited to be a part of that. The scale of the project is quite big though, which is why we started using Continuous Integration.

The way our cycles work is we perform big data updates, but by using CI, we can continuously update and integrate new data. We’re moving to a place, by using the practice of CI, we can perform a daily builds which gives us the time we need to fix problems before we absolutely need it to be live.

Karen: How do you implement CI into your day to day management of the project?

Michael: There are some pretty great open source tools that we’re using to implement CI. We use Jenkins to help us do Continuous Integration for frequent data builds, which is an intensive process for this particular client.

We field questions from the client about the status of different data builds. We hope to use Jenkins in conjunction with other tools to automatically build data and have event-based data builds. We’re looking at a way to have it triggered by some other event and have Jenkins automatically generate reports as the data is being built. Each time we run a build script, if the output differs from the previous build, Jenkins makes it easy for you to see that something is different. There is a way to modify your output that Jenkins can understand.  One of the cool things about Jenkins is they have graphs that illustrate differences to help us identify issues that could pose a potential problem and let us fix it before we need to go live with the data.

Karen: Any other tools?

Michael: For multi-node search clusters, we’re using a tool called fabric3 that uses SSH to copy data and execute scripts across multiple nodes of a cluster based upon roles. We have a clever set up where we’re able to inform fabric3 what services are running on each node in our cluster and have actions or commands linked to certain tasks, like building metadata.  By linking them, they automatically know which nodes to deploy data to.

Using open source tools like Jenkins and fabric3 make it a lot more manageable considering the large number of moving parts. It’s allowed us to be successful in building this incredible site and making the search function relevant, accurate and up to date.

How to get the MongoDB server version using PyMongo

If you’re using server-side features of MongoDB that have a minimum version requirement (like pushing a unique value to a list), it is a good idea to make sure you have the required version running on the server. To check the version of the MongoDB server using PyMongo, you can use something like this:

import pymongo

connection = pymongo.Connection()
serverVersion = tuple(connection.server_info()['version'].split('.'))
requiredVersion = tuple("1.3.3".split("."))
if serverVersion < requiredVersion:
    # handle the error
    return 1
...

It’s important to note that you must connect to the admin database to determine the version number. Otherwise, you will probably run into something like this:

pymongo.errors.OperationFailure: command SON([('buildinfo', 1)]) failed: access denied

If you need to check the version of the server from the interactive prompt, run the following from the mongo prompt:

> db.version()
1.4.2

A better way to add or update MySQL rows

Recently, we needed to iterate over a fairly large data set (on the order of millions) and do the ever-common If it’s not in the database, put it in.  If it’s already there, just update some fields. It’s a pattern that is very common for things like log files (where, for example, only a timestamp needs to be updated in some cases).

The obvious way of doing a SELECT, followed by either an UPDATE or an INSERT is too slow for even moderately-large datasets.  The better way to accomplish this is to use MySQL’s ON DUPLICATE KEY UPDATE directive.  By simply creating a unique key on the fields that should be different per-row, this syntax provides two specific benefits:

  • Allows batch (read: transaction) queries for large data
  • Increases performance overall versus making two separate queries

These benefits are especially helpful when your dataset is too large to fit into memory.  The obvious drawback to this method, however, is that it may put additional load on your database server.  Like anything else, it’s worth testing out your individual situation but, for us, ON DUPLICATE KEY UPDATE was the way to go.

Transparent MySQL migration using MySQL proxy

How can we transparently migrate MySQL from one server to another when we don’t want to disrupt end users? That was the question posed as we come to the final phase of decommissioning a server. We have transitioned almost all services away from the older server (CHIMAY)- but there is one external cron that is not under our control that we can see in the logs which generates several MySQL queries. Therefore- we need to transparently move MySQL through another server (TECATE). Here’s the scenario: Continue reading “Transparent MySQL migration using MySQL proxy”

Mysql table opens, caching and iowait times

One issue that seems to be very common when running a MySQL server is high iowait. As we have delved into our servers, one of the main causes we have found is that our table_cache was set too low, causing an ever increasing number of table opens

One issue that seems to be very common when running a MySQL server is high iowait. As we have delved into our servers, one of the main causes we have found is that our table_cache was set too low, causing an ever increasing number of table opens. As we increased the table_cache, and watched for higher hit percentages, we were stumped regarding how many table opens in a given time period were acceptable. A number which I use as a guideline was given to me during a MySQL training course this year. We use a guideline of no more than 10 table opens per hour. More than that- and we need to increase table_cache.