Optimizing Caching: Twemproxy and Memcached at Flickr

Introduction

Flickr places a high priority on our users’ experience, and a critical part of that experience is the speed of the interface.  Regardless of the client you use to access Flickr, caching the proper data and the speed at which our servers can access cached data is critical to delivering on that quality user experience.  The more effective our caching strategy is, the better the Flickr experience will be for our users.  This is true for all the layers of caching we deploy at Flickr from the photo caches to the process data caches buried deep in the system.  In a previous post, we looked at how regional photo caching improved photo serving time in other countries, in this post we’re going to dive down into the innards of Flickr’s software stack and take a look at how we improved Memcached performance for our backend systems.

Back in the olden days (pre-2014), we accessed our Memcached systems through a mix of direct reads from our web servers and writes through a Flickr-developed proxy.  Our proprietary proxy system, Cerberus, handled a whole host of responsibilities. In addition to Memcached set operations, Cerberus managed database updates, the bulk of our Redis accesses, cache consistencyCerberus Based Memcached Architecture (which is why we directed our writes through Cerberus), and a few other miscellaneous transactions.   As Flickr’s traffic and functionality had grown, Memcached set operation performance wasn’t keeping up, so we needed to consider how to address the gap.

Since the development of Cerberus, the software landscape had drastically changed.  When we developed Cerberus, there was no comparable software available, but now several open source projects exist that provide similar proxy services.  On top of the availability of open source tools, Flickr’s traffic and usage patterns have changed over the course of a decade, changing the requirements we had for a proxy system.  Needless to say we had a lot of questions to ask ourselves before we dived into revising the caching architecture.

After years of operation, we had a good picture into the strengths and weaknesses of our current architecture, so when we started thinking about revising it, a few lessons from the past years really stood out.  One thing that we learned over the years was that Cerberus’ lack of a single purpose made it difficult to troubleshoot operational issues with Memcached.  Due to the lack of isolation, a downstream issue with a user database, could impact Memcached access time.  Whatever came next had to isolate cache requests from other data accesses.

Experience had shown us that Memcached get operation latency was a key performance metric, and we learned through trial and error that placing our current Cerberus proxy between the web servers and the Memcached hosts added more network latency than we were willing to tolerate.  Unfortunately, our options for connection pooling and more efficient use of connections were limited in PHP, so we had little recourse than to suffer with a high connection load and fluctuating connections against our Memcached servers.  The next generation system would have to carefully monitor get operation timings and ensure we didn’t introduce more latency into the process.

So as 2014 rolled around, we started to look into an alternative to Cerberus for accessing the Memcached systems.  Should we build a Cerberus 2.0?  Should we look at an open source alternative?  As we weighed our options, one alternative that stood out because it was quite successful in other parts of Yahoo and throughout the industry was Twitter’s Twemproxy.

Introducing Twemproxy

Twemproxy is a lightweight daemon that proxies requests to a pool of Memcached instances.  It provides the following features that we believed would improve our caching infrastructure:

  • Consistent hashing:  Figuring out what hosts in the ring on which to store and retrieve cache data.  While we had implemented consistent hashing before Twemproxy, it was previously left to the client libraries to sort out.
  • Connection pooling:  Reuse of network connections to the Memcached instances, cutting down significantly on the connection load to the Memcached daemons.
  • Command Pipelining:  Accumulating requests destined for the same host and sending as a combined payload.  This feature further reduces connection load and network overhead to the cache processes.

Resulting Architecture

We implemented a solution where each web server host had two local Twemproxies, forwarding to the Memcached rings.

Twemproxy Based Memcached Architecture

In the resulting architecture, all Memcached operations go through twemproxy.  The change accomplished many goals, including:

  • Providing a dedicated system for Memcached requests that was isolated from other systems
  • Reducing the connection load on our Memcached servers through Twemproxy’s connection pooling.  We experienced a 75% overall reduction of TCP connections to Memcached nodes
  • Improved overall caching latency.  This was a benefit that we didn’t necessarily expect.  With Twemproxy, we found that get operations had a 5% reduction in mean processing time and set operations had a 40% in mean processing time

The Road to Twemproxy

As nice as it would be to say we dropped in Twemproxy, declared victory and went for ice cream, we still had to solve a few interesting challenges along the way: maintaining availability, dealing with disparate consistent hashing schemes, and re-implementing cache coherency.

If One is Good, Two is Even Better

From the start, we recognized that the simple daemon model of Twemproxy would need to be managed carefully.  Each deploy through our continuous deployment system could result in changes to the Memcached hosts configuration.  And unfortunately, configuration changes for Twemproxy require the process to be restarted to take effect.  We measured the time to restart Twemproxy in the 1-2 second range, but even for these necessary restarts, it was too much of an interruption for the clients.

Our solution was to run two instances of the daemon on every host that needed the service and manage a careful synchronization between the restarts.  This restart “dance” was wired into the process that deploys the configuration changes to all the Memcached clients.  A couple of patches have been proposed to allow configuring Twemproxy without restarting it, but none of them have yet made the master branch.

When is Ketama not Ketama?

Ketama is a popular consistent hashing algorithm used by many systems to determine where to place a particular key in a multi-node caching system.  Out of the box, Twemproxy uses an implementation of the Ketama algorithm that is compatible with libketama, the C library which is the most commonly used implementation of the Ketama algorithm.

Our initial implementation of consistent hashing was done using Ketama, but with the Spymemcached Java library.  It turns out that Spymemcached has a slight variation in the implementation that makes it incompatible with Twemproxy.

Our transition from our current system to Twemproxy had to happen live, and a sudden change in the cache algorithm would have a painful (and unacceptable) impact on our database systems.  How could we get across this bridge?  Ultimately, we had to patch Twemproxy’s implementation of Ketama to match Spymemcached to maintain a consistent implementation of the Ketama algorithm.

Redis latency in propagating cache clears

Until we figure out how to change the speed of light, the only way we are going to make Flickr fast across the world, is through multiple data centers conveniently located near our users.  While this is way easier than changing the speed of light, it’s not without its complications.

What do they compute at Night ?

Caches between the data centers have to be kept consistent.  Some caches, like photo caches, deal with immutable data and are easy to keep in synch, others like Memcached systems have read-write data which is harder.  Our approach to handling cache consistency in our Memcached systems was to invalidate stale keys in other colo facilities whenever a process updated a value.  As we mentioned previously, Memcache write operations were directly through our Cerberus proxy specifically so Cerberus could dispatch a cache invalidation event to other colo facilities.  The migration to Twemproxy would not be complete, until we implemented a new solution for cache invalidation.

In our Twemproxy-based architecture, we decided to take the responsibility for cache invalidation our of the hands of the data proxy and push it into the client libraries we used to access Memcached.  Whenever a client updates a Memcache key, it enqueues a corresponding cache invalidation event into a distributed Redis queue.  We then deployed simple, single-purpose Java daemons to process the cache invalidation events from the Redis queue and delete the corresponding keys in their local Memcached systems.  A diagram of the system, appears below: Cache Clear - Blog Post

The wrinkle with this approach was that the enqueuing of clear keys would occasionally take 20 times longer than the normal mean time, pushing cache sets up to 40ms. After much digging, we found that the spikes were happening when the clearing daemons dequeued a batch of keys.  The dequeuing daemons were performing operations across a WAN. Due to the single-threaded nature of Redis, it would periodically block the queue for adding keys for 10s of milliseconds.  Once we figured that out, the fix was a matter of keeping separate in- and out-queues, and moving the keys from in to out with a local daemon, which significantly reduced the blocked time for writing keys.

Conclusion

Caching is crucial to a high-traffic site like Flickr, and we have taken a big stride in making our Memcached utilization more effective.  Using Twemproxy, we were able to clean up an internal system, reduce the connection load on our caching daemons, and even make modest improvements to caching latency for all clients.  Although we faced some technical challenges in implementing twemproxy for Memcached, particularly in propagating cache clear events, it was ultimately well worth the engineering investment.  After several months, our implementation of Twemproxy has proven to make a positive contribution to caching speed and ultimately the experience of a responsive site for our users.

If you dream in low latency and love to rip that extra 10 microseconds of overhead out of an operation, we’d love to have you! Stop by our Jobs page and tell us how awesome you are.

Exploring Life Without Compass

Compass is a great thing. At Flickr, we’re actually quite smitten with it. But being conscious of your friends’ friends is important (you never know who they’ll invite to your barbecue), and we’re not so sure about this “Ruby” that Compass is always hanging out with. Then there’s Ruby’s friend Bundler who, every year at the Christmas Party, tells the same stupid story about the time the police confused him with a jewelry thief. Enough is enough! We’ve got history, Compass, but we just feel it might be time to try seeing other people.

Solving for Sprites

In order to find a suitable replacement (and for a bit of closure), we had to find out what kept us relying on Compass for so long. We knew the big one coming in to this great experiment: sprites. Flickr is a huge site with many different pages, all of which have their own image folders that need to be sprited together. There are a few different options for compiling sprites on your own, but we liked spritesmith for its multiple image rendering engines. This gives us some flexibility in dependencies.

A grunt task is available for spritesmith, but it assumes you are generating only one sprite. Our setup is a bit more complex and we’d like to keep our own sprite mixin intact so we don’t actually have to change a line of code. With spritesmith and our own runner to iterate over our sprite directories, we can easily create the sprites and output the dimensions and urls via a simple Handlebars template to a Sass file.

{{#each sprites}}
    {{#each images}}
        %{{../dir}}-{{name}}-dimensions {
            width: {{coords.width}}px;
            height: {{coords.height}}px;
        }
        %{{../dir}}-{{name}}-background {
            background: image-url('{{../url}}') -{{coords.x}}px -{{coords.y}}px no-repeat;
        }
    {{/each}}
{{/each}}

You could easily put all three of these rules in the same declaration, but we have some added flexibility in mind for our mixin.

It’s important to note that, because we’re using placeholders (the % syntax in Sass), nothing is actually written out unless we use it. This keeps our compiled CSS nice and clean (just like Compass)!

@import 'path/to/generated/sprite/file'

@mixin background-sprite($icon, $set-dimensions: false) {
    @extend %#{$spritePath}-#{$icon}-background;

    @if $set-dimensions == true {
        @extend %#{$spritePath}-#{$icon}-dimensions;
    }
}

Here, our mixin uses the Sass file we generated to provide powerful and flexible sprites. Note: Although retina isn’t shown here, adding support is as simple as extending the Sass mixin with appropriate media queries. We wanted to keep the example simple for this post, but it gives you an idea of just how extensible this setup is!

Now that the big problem is solved, what about the rest of Compass’s functionality?

Completing the Package

How do we account for the remaining items in the Compass toolbox? First, it’s important to find out just how many mixins, functions, and variables are used. An easy way to find out is to compile with Sass and see how much it complains!


sass --update assets/sass:some-temp-dir

Depending on the complexity of your app, you may see quite a lot of these errors.


error assets/css/base.scss (Line 3: Undefined mixin 'font-face'.)

In total, we’re missing 16 mixins provided by Compass (and a host of variables). How do we replace all the great mixin functionality of Compass? With mixins of the same name, node-bourbon is a nice drop-in replacement.

What is the point of all this work again?

The Big Reveal

Now that we’re comfortably off Compass, how exactly are we going to compile our Sass? Well try not to blink, because this is the part that makes it all worthwhile.

Libsass is a blazing-fast C port of the Sass compiler that exposes bindings to modules like node-sass.

Just how fast? With Compass, our compile times were consistently around a minute and a half to two minutes. Taking care of spriting ourselves and using libsass for Sass compilation, we’re down to 5 seconds. When you deploy as often as we do at Flickr (in excess of 10 times a day), that adds up and turns into some huge savings!

What’s the Catch?

There isn’t one! Oh, okay. Maybe there are a few little ones. We’re pretty willing to swallow them though. Did you see that compile time?!

There are some differences, particularly with the @extend directive, between Ruby Sass and libsass. We’re anticipating that these small kinks will continue to be ironed out as the port matures. Additionally, custom functions aren’t supported yet, so some extensibility is lost in coming from Ruby (although node-sass does have support for the image-url built-in which is the only one we use, anyway).

With everything taken into account, we’re counting down the days until we make this dream a reality and turn it on for our production builds.

Flickr flamily floto

Like what you’ve read and want to make the jump with us? We’re hiring engineers, designers and product managers in our San Francisco office. Find out more at flickr.com/jobs.