Database – TNR Global

Bishop: Makes Your Web Service Shiny

The idea is to provide a relatively small library that will make your life easier and hopefully more pleasant by making it straightforward to provide a consistent web service API that obeys HTTP semantics.

Christopher Miles, one of our Senior Software Developers here at TNR, wrote this post on Bishop. It all started when he was asked the question:

“What happens if I send it something that’s not JSON?”…….“I don’t know, but I bet it logs a really big stack trace!”

The question begged an answer, and Chris give an extremely thorough answer in his own blog. Here’s a small except that gives a taste of his analysis:

After taking a closer look at HTTP and it’s specification, it was clear that it could do a lot more than I had thought. Looking back on past projects, it’s painfully obvious that I’ve been taking what is really an application protocol and ignoring all of the interesting bits, instead using it as little more than a pipe to push documents through. I’ve been using either the requested URL or parameters or maybe even neither and simply examined the body content, thus eliminating any of the real advantages of using HTTP in the first place.

And there are advantages. The protocol is already thinking about caching your data where it makes the most sense. There’s already an algorithm for taking the list of content types that the client wants and the content types the server provides and picking the best match. It can manage safe updating of resources as well as notifying the client of conflicts. And so on, by ignoring what the HTTP protocol has to offer I was making more work for myself.

So he decided to take matters into his own hands by creating a library. More from Chris’ blog:

The idea is to provide a relatively small library that will make your life easier and hopefully more pleasant by making it straightforward to provide a consistent web service API that obeys HTTP semantics. It will make the lives of those around you easier as well, clients can expect your service to respond to the common HTTP request methods with reasonable responses. Placing caches around your service will also be much simpler and you’ll have some level of control over how your service’s data is cached.

Since creating this library, other developers have responded positively and are watching the project. If you would like to take a look at how our approach to solving this problem, take a look for yourself here.

If you’d like to talk to us on how we can solve some of your enterprise search, cloud, or scalability issues, contact us.

New Systems and DevOps Blog

has lots of new approaches to discuss in terms of systems, cloud computing, DevOps, System Architecture, and how developers and systems staff need to communicate well and work together

Our VP of Systems Administration, Michael Klatsky has started a blog specifically discussing Systems. Fresh from the AWS Summit 2012 in NYC, Michael has lots of new approaches to discuss in terms of systems, cloud computing, DevOps, System Architecture, and how developers and systems staff need to communicate well and work together for the best results in web development. The blog is his own but we feel it’s a great technical resource for our colleagues in systems and web development. You can take a look at his blog here. Michael welcomes commentary and discussion, and hopes to provide some shortcuts for fellow System Administrators.

Continuous Integration for Large Search Solutions

Managing large projects takes a smart approach and some intuitive thinking. One project we are currently engaged in is with large publisher of manufacturing parts. This has been an extraordinary project due to its scale and ever changing scope. I spoke with our VP of Enterprise Search Technologies, Michael McIntosh about how TNR Global handles complex projects.

Karen: This project is a big one. Tell me more about the site’s function. What is the focus?

Michael: Product search is the focus. The site contains tens of millions of documents, both structured and unstructured content. They also have a huge amount of data provided by the advertisers and the companies themselves on products that they sell. One of the advantages we have over a search engine like Google is access to a vast amount of propriety data provided by the vendors themselves.

Karen: Tell me about how you are managing the project. What are some of the variables you work with?

Michael: With this particular project, we are dealing with many different data feeds. There are many different intermediary metadata stages we have to generate to support the final searchable content. The client also changes their business logic frequently enough that if it takes a month or more between data builds its likely something has changed. For instance, they might have changed an XML format or added an attribute to an element in the data feed that will break something else down the line. The problem is there are so many moving parts, it’s almost impossible to do it manually and always do it correctly.

Karen: What other kinds of business logic changes are you dealing with in top of the massive amounts of raw data?

Michael: Most of the business logic changes are when they need to modify how something behaves based on new data that’s available, or when they need to start treating the data in a different way. Sometimes there is a change in the way they want the overall system to behave. They sometimes have some classification rules for content they like to tweak occasionally.

Another thing we consider is the client’s relevancy scoring and query pre-processing rules. So you need to consider if you issue a query and it fails, what happens then? What kind of fallback query do you use? All these things are part of the business logic that is independent of the raw data. In summary, we have the raw data and we can do a number of things with it. They often want us to change exactly what we’re doing with it, how we’re conditioning it, and how we’re transforming it. We either tweak what exists or take advantage of new data that they’ve started including in their data feeds. The challenge is all these elements can change frequently.

Karen: This site is more of a portal than strictly an enterprise search project, isn’t it?

Michael: Yes. Enterprise search usually refers to searching for documents within an organization. This client is a public facing search engine that allows the public to perform product search across a very large number of vendors and service providers.

Changes come from their advertisers and data they provide. Advertisers come and go. People pay for placement within certain industrial categories. It’s not like we get a static list of sites to crawl and that’s that. It changes weekly, sometimes daily. This list of sites we crawl is on a weekly or daily basis. Also things need to be purged from the index. Say an advertiser’s contract ends and suddenly we need to stop crawling a site with thousands documents; that data needs to be purged from the index promptly. Not only do we have to crawl new sites but purge old ones as well. This is a project that is so massive that it’s not cut and dried. A lot of software development projects focus on a clear cut problem, come up with plan, tackle it, release it, and then maintain it. We’re constantly getting new information and learn new things about people hitting the site.

Karen: So this sounds like this project is always in a state of ongoing development.

Michael: We are building something that’s never been built before. One of the goals is to make this site remarkable. And we’re very excited to be a part of that. The scale of the project is quite big though, which is why we started using Continuous Integration.

The way our cycles work is we perform big data updates, but by using CI, we can continuously update and integrate new data. We’re moving to a place, by using the practice of CI, we can perform a daily builds which gives us the time we need to fix problems before we absolutely need it to be live.

Karen: How do you implement CI into your day to day management of the project?

Michael: There are some pretty great open source tools that we’re using to implement CI. We use Jenkins to help us do Continuous Integration for frequent data builds, which is an intensive process for this particular client.

We field questions from the client about the status of different data builds. We hope to use Jenkins in conjunction with other tools to automatically build data and have event-based data builds. We’re looking at a way to have it triggered by some other event and have Jenkins automatically generate reports as the data is being built. Each time we run a build script, if the output differs from the previous build, Jenkins makes it easy for you to see that something is different. There is a way to modify your output that Jenkins can understand. One of the cool things about Jenkins is they have graphs that illustrate differences to help us identify issues that could pose a potential problem and let us fix it before we need to go live with the data.

Karen: Any other tools?

Michael: For multi-node search clusters, we’re using a tool called fabric3 that uses SSH to copy data and execute scripts across multiple nodes of a cluster based upon roles. We have a clever set up where we’re able to inform fabric3 what services are running on each node in our cluster and have actions or commands linked to certain tasks, like building metadata. By linking them, they automatically know which nodes to deploy data to.

Using open source tools like Jenkins and fabric3 make it a lot more manageable considering the large number of moving parts. It’s allowed us to be successful in building this incredible site and making the search function relevant, accurate and up to date.

How to get the MongoDB server version using PyMongo

If you’re using server-side features of MongoDB that have a minimum version requirement (like pushing a unique value to a list), it is a good idea to make sure you have the required version running on the server. To check the version of the MongoDB server using PyMongo, you can use something like this:

import pymongo

connection = pymongo.Connection()
serverVersion = tuple(connection.server_info()['version'].split('.'))
requiredVersion = tuple("1.3.3".split("."))
if serverVersion < requiredVersion:
    # handle the error
    return 1
...

It’s important to note that you must connect to the admin database to determine the version number. Otherwise, you will probably run into something like this:

pymongo.errors.OperationFailure: command SON([('buildinfo', 1)]) failed: access denied

If you need to check the version of the server from the interactive prompt, run the following from the mongo prompt:

> db.version()
1.4.2

Transparent MySQL migration using MySQL proxy

How can we transparently migrate MySQL from one server to another when we don’t want to disrupt end users? That was the question posed as we come to the final phase of decommissioning a server. We have transitioned almost all services away from the older server (CHIMAY)- but there is one external cron that is not under our control that we can see in the logs which generates several MySQL queries. Therefore- we need to transparently move MySQL through another server (TECATE). Here’s the scenario: Continue reading “Transparent MySQL migration using MySQL proxy”

Busy holidays, and back to blogging

It’s been a long, busy holiday- and now time to resume blogging. My latest interest is testing our mysql servers using mysqlslap. While a great tool, it is unfortunately only distributed with MySQL 5.1.4 and above. However, many of our servers are in the 5.0 release, with some in the 4.x release. What I did was grab the latest 5.1.x version of mysql, compile it statically and test it out on a MYSQL 5.0.45 version, and it worked just fine. I have yet to test it on another machine, or against a version other than 5.0.45. I’ll update when I do.

Here are the steps I took:

1) wget “http://dev.mysql.com/get/Downloads/MySQL-5.1/mysql-5.1.30.tar.gz/from/http://mysql.llarian.net/”

2) tar -xf mysql-5.1.30.tar.gz&&cd mysql-5.1.30

3) ./configure –without-server –disable-shared

4) make&&cp -i ./client/mysqlslap /usr/local/bin/

Securely specify mysql credentials in automated scripts

Often, you may want to run a script that uses a username and password to access data in a MySQL database. Securely running a script like this manually is easy – simply use the ‘-p’ option for the MySQL client, and it will prompt you for the password. However, this is not an option if you want to automate the script.

There are several ways to provide the password in a way that can be used with automated scripts, but only one that is both flexible and secure. You can specify the password on the command line itself (with ‘mysql -p ‘); however, this allows the password be seen by other users who run commands like ‘ps’. Another option is setting the environment variable “MYSQL_PWD” to the password, but this can also be seen by other users. Continue reading “Securely specify mysql credentials in automated scripts”