Introducing yakbak: Record and playback HTTP interactions in NodeJS

Did you know that the new Front End of www.flickr.com is one big Flickr API client? Writing a client for an existing API or service can be a lot of fun, but decoupling and testing that client can be quite tricky. There are many different approaches to taking the backing service out of the equation when it comes to writing tests for client code. Today we’ll discuss the pros and cons of some of these approaches, describe how the Flickr Front End team tests service-dependent libraries, and introduce you to our new NodeJS HTTP playback module: yakbak!

Scenario: Testing a Flickr API Client

Let’s jump into some code, shall we? Suppose we’re testing a (very, very simple) photo search API client:

https://gist.github.com/jeremyruppel/fd25c723a5962a49936f174d765aa11a

Currently, this code will make an HTTP request to the Flickr API on every test run. This is less than desirable for several reasons:

  • UGC is unpredictable. In this test, we’re asserting that the response code is an HTTP 200, but obviously our client code needs to provide the response data to be useful. It’s impossible to write a meaningful and predictable test against live content.
  • Traffic is unpredictable. This photos search API call usually takes ~150ms for simple queries, but a more complex query or a call during peak traffic may take longer.
  • Downtime is unpredictable. Every service has downtime (the term is “four nines,” not “one hundred percent” for a reason), and if your service is down, your client tests will fail.
  • Networks are unpredictable. Have you ever tried coding on a plane? Enough said.

We want our test suite to be consistent, predictable, and fast. We’re also only trying to test our client code, not the API. Let’s take a look at some ways to replace the API with a control, allowing us to predictably test the client code.

Approach 1: Stub the HTTP client methods

We’re using superagent as our HTTP client, so we could use a mocking library like sinon to stub out superagent’s Request methods:

https://gist.github.com/jeremyruppel/8b837f439663db325aaa2437a2259934

With these changes, we never actually make an HTTP request to the API during a test run. Now our test is predictable, controlled, and it runs crazy fast. However, this approach has some major drawbacks:

  • Tightly coupled with superagent. We’re all up in the client’s implementation details here, so if superagent ever changes their API, we’ll need to correct our tests to match. Likewise, if we ever want to use a different HTTP client, we’ll need to correct our tests as well.
  • Difficult to specify the full HTTP response. Here we’re only specifying the statusCode; what about when we need to specify the body or the headers? Talk about verbose.
  • Not necessarily accurate. We’re trusting the test author to provide a fake response that matches what the actual server would send back. What happens if the API changes the response schema? Some unhappy developer will have to manually update the tests to match reality (probably an intern, let’s be honest).

We’ve at least managed to replace the service with a control in our tests, but we can do (slightly) better.

Approach 2: Mock the NodeJS HTTP module

Every NodeJS HTTP client will eventually delegate to the standard NodeJS http module to perform the network request. This means we can intercept the request at a low level by using a tool like nock:

https://gist.github.com/jeremyruppel/d92a62400f635b42249adc041cdecc96

Great! We’re no longer stubbing out superagent and we can still control the HTTP response. This avoids the HTTP client coupling from the previous step, but still has many similar drawbacks:

  • We’re still completely implementation-dependent. If we want to pass a new query string parameter to our service, for example, we’ll also need to add it to the test so that nock will match the request.
  • It’s still laborious to specify the response headers, body, etc.
  • It’s still difficult to make sure the response body always matches reality.

At this point, it’s worth noting that none of these bullet points were an issue back when we were actually making the HTTP request. So, let’s do exactly that (once!).

Approach 3: Record and playback the HTTP interaction

The Ruby community created the excellent VCR gem for recording and replaying HTTP interactions during tests. Recorded HTTP requests exist as “tapes”, which are just files with some sort of format describing the interaction. The basic workflow goes like this:

  1. The client makes an actual HTTP request.
  2. VCR sits in front of the system’s HTTP library and intercepts the request.
  3. If VCR has a tape matching the request, it simply replays the response to the client.
  4. Otherwise, VCR lets the HTTP request through to the service, records the interaction to a new tape on disk and plays it back to the client.

Introducing yakbak

Today we’re open-sourcing yakbak, our take on recording and playing back HTTP interactions in NodeJS. Here’s what our tests look like with a yakbak proxy:

https://gist.github.com/jeremyruppel/7050b34342a10d8e3dd8bc2dba0d50c0

Here we’ve created a standard NodeJS http.Server with our proxy middleware. We’ve also configured our client to point to the proxy server instead of the origin service. Look, no implementation details!

yakbak tries to do things The Node Way™ wherever possible. For example, each yakbak “tape” is actually its own module that simply exports an http.Server handler, which allows us to do some really cool things. For example, it’s trivial to create a server that always responds a certain way. Since the tape’s hash is based solely on the incoming request, we can easily edit the response however we like. We’re also kicking around a handful of enhancements that should make yakbak an even more powerful development tool.

Thanks to yakbak, we’ve been writing fast, consistent, and reliable tests for our HTTP clients and applications. Want to give it a spin? Check it out today: https://github.com/flickr/yakbak

P.S. We’re hiring!

Do you love development tooling and helping keep teams on the latest and greatest technology? Or maybe you just want to help build the best home for your photos on the entire internet? We’re hiring Front End Ops and tons of other great positions. We’d love to hear from you!

Redis Global Locks Redux

In my last post I described how we use Redis to manage a global lock that allows us to automatically failover to a backup process if there was a problem in the primary process. The method described allegedly allowed for any number of backup processes to work in conjunction to pick up on primary failures and take over processing.

Locks #1
Locks #1 by Christoph Kummer

Thanks to an astute reader, it was pointed out that the code in the blog wouldn’t actually work as advertised:

 

The Problem

Nolan correctly noticed that when the backup processes attempts to acquire the lock via SETNX, that lock key will already exist from when it was acquired by the primary, and thus all subsequent attempts to acquire locks will simply end up constantly trying to acquire a lock that can never be acquired. As a reminder, here’s what we do when we check back on the status of a lock:

function checkLock(payload, lockIdentifier) {
    client.get(lockIdentifier, function(error, data) {
        // Error handling elided for brevity
        if (data !== DONE_VALUE) {
            acquireLock(payload, data + 1, lockCallback);
        } else {
            client.del(lockIdentifier);
        }
    });
}

And here’s the relevant bit from acquireLock that calls SETNX:

    client.setnx(lockIdentifier, attempt, function(error, data) {
        if (error) {
            logger.error("Error trying to acquire redis lock for: %s", lockIdentifier);
            return callback(error, dataForCallback(false));
        }

        return callback(null, dataForCallback(data === 1));
    });

So, you’re thinking, how could this vaunted failover process ever actually work? The answer is simple: the code from that post isn’t what we actually run. The actual production code has a single backup process, so it doesn’t try to re-acquire the lock in the event of failure, it just skips right to trying to send the message itself. In the previous post, I described a more general solution that would work for any number of backup processes, but I missed this one important detail.

That being said, with some relatively minor changes, it’s absolutely possible to support an arbitrary number of backup processes and still maintain the use of the global lock. The trivial solution is to simply have the backup process delete the key before trying to re-acquire the lock (or, technically acquire it anew). However, the problem with that becomes apparent pretty quickly. If there are multiple backup processes all deleting the lock and trying to SETNX a new lock again, there’s a good chance that a race condition could arise wherein one of backups deletes a lock that was acquired by another backup process, rather than the failed lock from the primary.

The Solution

Thankfully, Redis has a solution to help us out here: transactions. By using a combination of WATCH, MULTI, and EXEC, we can perform actions on the lock key and be confident that no one has modified it before our actions can complete. The process to acquire a lock remains the same: many processes will issue a SETNX and only one will win. The changes come into play when the processes that didn’t acquire the lock check back on its status. Whereas before, we simply checked the current value of the lock key, now we must go through the above described Redis transaction process. First we watch the key, then we do what amounts to a check and set (albeit with a few different actions to perform based on the outcome of the check):

function checkLock(payload, lockIdentifier, lastCount) {
    client.watch(lockIdentifier);
    client.multi()
        .get(lockIdentifier)
        .exec(function(error, replies) {
            if (!replies) {
                // Lock value changed while we were checking it, someone else got the lock
                client.get(lockIdentifier, function(error, newCount) {
                    setTimeout(checkLock, LOCK_EXPIRY, payload, lockIdentifier, newCount);
                });

                return;
            }

            var currentCount = replies[0];
            if (currentCount === null) {
                // No lock means someone else completed the work while we were checking on its status and the key has already been deleted
                return;
            } else if (currentCount === DONE_VALUE) {
                // Another process completed the work, let’s delete the lock key
                client.del(lockIdentifier);
            } else if (currentCount == lastCount) {
                // Key still exists, and no one has incremented the lock count, let’s try to reacquire the lock
                reacquireLock(payload, lockIdentifier, currentCount, doWork);
            } else {
                // Key still exists, but the value does not match what we expected, someone else has reacquired the lock, check back later to see how they fared
                setTimeout(checkLock, LOCK_EXPIRY, payload, lockIdentifier, currentCount);
            }
        });
}

As you can see, there are five basic cases we need to deal with after we get the value of the lock key:

  1. If we got a null reply back from Redis, that means that something else changed the value of our key, and our exec was aborted; i.e. someone else got the lock and changed its value before we could do anything. We just treat it as a failure to acquire the lock and check back again later.
  2. If we get back a reply from Redis, but the value for the key is null, that means that the work was actually completed and the key was deleted before we could do anything. In this case there’s nothing for us to do at all, so we can stop right away.
  3. If we get back a value for the lock key that is equal to our sentinel value, then someone else completed the work, but it’s up to us to clean up the lock key, so we issue a Redis DEL and call our job done.
  4. Here’s where things get interesting: if the key still exists, and its value (the number of attempts that have been made) is equal to our last attempt count, then we should try and reacquire the lock.
  5. The last scenario is where the key exists but its value (again, the number of attempts that have been made) does not equal our last attempt count. In this case, someone else has already tried to reacquire the lock and failed. We treat this as a failure to acquire the lock and schedule a timeout to check back later to see how whoever did acquire the lock got on. The appropriate action here is debatable. Depending on how long your underlying work takes, it may be better to actually try and reacquire the lock here as well, since whoever acquired the lock may have already failed. This can, however, lead to premature exhaustion of your attempt allotment, so to be safe, we just wait.

So, we’ve checked on our lock, and, since the previous process with the lock failed to complete its work, it’s time to actually try and reacquire the lock. The process in this case is similar to the above inasmuch as we must use Redis transactions to manage the reacquisition process, thankfully however, the steps are (somewhat) simpler:

function reacquireLock(payload, lockIdentifier, attemptCount, callback) {
    client.watch(lockIdentifier);
    client.get(lockIdentifier, function(error, data) {
        if (!data) {
            // Lock is gone, someone else completed the work and deleted the lock, nothing to do here, stop watching and carry on
            client.unwatch();
            return;
        }

        var attempts = parseInt(data, 10) + 1;

        if (attempts > MAX_ATTEMPTS) {
            // Our allotment has been exceeded by another process, unwatch and expire the key
            client.unwatch();
            client.expire(lockIdentifier, ((LOCK_EXPIRY / 1000) * 2));
            return;
        }

        client.multi()
            .set(lockIdentifier, attempts)
            .exec(function(error, replies) {
                if (!replies) {
                    // The value changed out from under us, we didn't get the lock!
                    client.get(lockIdentifier, function(error, currentAttemptCount) {
                        setTimeout(checkLock, LOCK_TIMEOUT, payload, lockIdentifier, currentAttemptCount);
                    });
                } else {
                    // Hooray, we acquired the lock!
                    callback(null, {
                        "acquired" : true,
                        "lockIdentifier" : lockIdentifier,
                        "payload" : payload
                    });
                }
            });
    });
}

As with checkLock we start out by watching the lock key, and proceed do a (comparitively) simplified check and set. In this case, we’ve “only” got three scenarios to deal with:

  1. If we’ve already exceeded our allotment of attempts, it’s time to give up. In this case, the allotment was actually exceeded in another worker, so we can just stop right away. We make sure to unwatch the key, and set it expire at some point far enough in the future that any remaining processes attempting to acquire locks will also see that it’s time to give up.

Assuming we’re still good to keep working, we try and update the lock key within a MULTI/EXEC block, where we have our remaining two scenarios:

  1. If we get no replies back, that again means that something changed the value of the lock key during our transaction and the EXEC was aborted. Since we failed to acquire the lock we just check back later to see what happened to whoever did acquire the lock.
  2. The last scenario is the one in which we managed to acquire the lock. In this case we just go ahead and do our work and hopefully complete it!

Bonus!

To make managing global locks even easier, I’ve gone ahead and generalized all the code mentioned in both this and the previous post on the subject into a tidy little event based npm package: https://github.com/yahoo/redis-locking-worker. Here’s a quick snippet of how to implement global locks using this new package:

var RedisLockingWorker = require("redis-locking-worker”);

var SUCCESS_CHANCE = 0.15;

var lock = new RedisLockingWorker({
    "lockKey" : "mylock",
    "statusLevel" : RedisLockingWorker.StatusLevels.Verbose,
    "lockTimeout" : 5000,
    "maxAttempts" : 5
});

lock.on("acquired", function(lastAttempt) {
    if (Math.random() <= SUCCESS_CHANCE) {
        console.log("Completed work successfully!", lastAttempt);
        lock.done(lastAttempt);
    } else {
        // oh no, we failed to do work!
        console.log("Failed to do work");
    }
});
lock.acquire();

There’s also a few other events you can use to track the lock status:

lock.on("locked", function() {
    console.log("Did not acquire lock, someone beat us to it");
});

lock.on("error", function(error) {
    console.error("Error from lock: %j", error);
});

lock.on("status", function(message) {
    console.log("Status message from lock: %s", message);
});

More Bonus!

If you don’t need the added complexity if multiple backup processes, I also want to give credit to npm user pokehanai who took the methodology described in the original post and created a generalized version of the two-worker solution: https://npmjs.org/package/redis-paired-worker.

Wrapping Up

So there you have it! Coordinating work on any number of processes across any number of hosts couldn’t be easier! If you have any questions or comments on this, please feel free to follow up on Twitter.

Flickr flamily floto

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring engineers, designers and product managers in our San Francisco office. Find out more at flickr.com/jobs.