Flickr’s experience with iOS 9

In the last couple of months, Apple has released new features as part of iOS 9 that allow a deeper integration between apps and the operating system. Among those features are Spotlight Search integration, Universal Links, and 3D Touch for iPhone 6S and iPhone 6S Plus.

Here at Flickr, we have added support for these new features and we have learned a few lessons that we would love to share.

Spotlight Search

There are two different kinds of content that can be searched through Spotlight: the kind that you explicitly index, and the kind that gets indexed based on the state your app is in. To explicitly index content, you use Core Spotlight, which lets you index multiple items at once. To index content related to your app’s current state, you use NSUserActivity: when a piece of content becomes visible, you start an activity to make iOS aware of this fact. iOS can then determine which pieces of content are more frequently visited, and thus more relevant to the user. NSUserActivity also allows us to mark certain items as public, which means that they might be shown to other iOS users as well.

For a better user experience, we index as much useful information as we can right off the bat. We prefetch all the user’s albums, groups, and people they follow, and add them to the search index using Core Spotlight. Indexing an item looks like this:

// Create the attribute set, which encapsulates the metadata of the item we're indexing
CSSearchableItemAttributeSet *attributeSet = [[CSSearchableItemAttributeSet alloc] initWithItemContentType:(NSString *)kUTTypeImage];
attributeSet.title = photo.title;
attributeSet.contentDescription = photo.searchableDescription;
attributeSet.keywords = photo.keywords;
attributeSet.thumbnailData = UIImageJPEGRepresentation(photo.thumbnail, 0.98);

// Create the searchable item and index it.
CSSearchableItem *searchableItem = [[CSSearchableItem alloc] initWithUniqueIdentifier:[NSString stringWithFormat:@"%@/%@", photo.identifier, photo.searchContentType] domainIdentifier:@"FLKCurrentUserSearchDomain" attributeSet:attributeSet];
[[CSSearchableIndex defaultSearchableIndex] indexSearchableItems:@[ searchableItem ] completionHandler:^(NSError * _Nullable error) {
                       if (error) {
                           // Handle failures.

Since we have multiple kinds of data – photos, albums, and groups – we had to create an identifier that is a combination of its type and its actual model ID.

Many users will have a large amount of data to be fetched, so it’s important that we take measures to make sure that the app still performs well. Since searching is unlikely to happen right after the user opens the app (that’s when we start prefetching this data, if needed), all this work is performed by a low-priority NSOperationQueue. If we ever need to fetch images to be used as thumbnails, we request it with low-priority NSURLSessionDownloadTask. These kinds of measures ensure that we don’t affect the performance of any operation or network request triggered by user actions, such as fetching new images and pages when scrolling through content.

Flickr provides a huge amount of public content, including many amazing photos. If anybody searches for “Northern Lights” in Spotlight, shouldn’t we show them our best Aurora Borealis photos? For this public content – photos, public groups, tags and so on – we leverage NSUserActivity, with its new search APIs, to make it all searchable when viewed. Here’s an example:

CSSearchableItemAttributeSet *attributeSet = [[CSSearchableItemAttributeSet alloc] initWithItemContentType:(NSString *) kUTTypeImage];
// Setup attributeSet the same way we did before...
// Set the related unique identifier, so it matches to any existing item indexed with Core Spotlight.     
attributeSet.relatedUniqueIdentifier = [NSString stringWithFormat:@"%@/%@", photo.identifier, photo.searchContentType];
self.userActivity = [[NSUserActivity alloc] initWithActivityType:@"FLKSearchableUserActivityType"];
self.userActivity.title = photo.title;
self.userActivity.keywords = [NSSet setWithArray:photo.keywords];
self.userActivity.webpageURL = photo.photoPageURL;
self.userActivity.contentAttributeSet = attributeSet;
self.userActivity.eligibleForSearch = YES;
self.userActivity.eligibleForPublicIndexing = photo.isPublic;
self.userActivity.requiredUserInfoKeys = [NSSet setWithArray:self.userActivity.userInfo.allKeys];
[self.userActivity becomeCurrent];

Every time a user opens a photo, public group, location page, etc., we create a new NSUserActivity and make it current. The more often a specific activity is made current, the more relevant iOS considers it. In fact, the more often an activity is made current by any number of different users, the more relevant Apple considers it globally, and the more likely it will show up for other iOS users as well (provided it’s public).

Until now we’ve only seen half the picture. We’ve seen how to index things for Spotlight search; when a user finally does search and taps on a result, how do we take them to the right place in our app? We’ll get to this a bit later, but for now suffice it to say that you’ll get a call to the method application:continueUserActivity:restorationHandler: to our application delegate.

It’s important to note that if we wanted to make use of the userInfo in the NSUserActivity, iOS won’t give it back to you for free in this method. To get it, we have to make sure that we assigned an NSSet to the requiredUserInfoKeys property of our NSUserActivity when we created it. In their documentation, Apple also tells us that if you set the webpageURL property when eligibleForSearch is YES, you need to make sure that you’re pointing to the right web URL corresponding to your content, otherwise you might end up with duplicate results in Spotlight (Apple crawls your site for content to surface in Spotlight, and if it finds the same content at a different URL it’ll think it’s a different piece of content).

Universal Links

In order to support Universal Links, Apple requires that every domain supported by the app host an “apple-app-site-association” file at its root. This is a JSON file that describes which relative paths in your domains can be handled by the app. When a user taps a link from another app in iOS, if your app is able to handle that domain for a specific path, it will open your app and call application:continueUserActivity:restorationHandler:. Otherwise your application won’t be opened – Safari will handle the URL instead.

    "applinks": {
        "apps": [],
        "details": {
            "": {
                "paths": [

This file has to be hosted on HTTPS with a valid certificate. Its MIME type needs to be “application/pkcs7-mime.” No redirects are allowed when requesting the file. If the only intent is to support Universal Links, no further steps are required. But if you’re also using this file to support Handoffs (introduced in iOS 8), then your file has to be CMS signed by a valid TLS certificate.

In Flickr, we have a few different domains. That means that each one of,, and must provide its own JSON association file, whether or not they differ. In our case, the domain actually does support different paths, since it’s only used for short URLs; hence, its “apple-app-site-association” is different than the others.

On the client side, only a few steps are required to support Universal Links. First, “Associated Domains” must be enabled under the Capabilities tab of the app’s target settings. For each supported domain, an entry “applinks:” entry must be added. Here is how it looks for Flickr:

Screen Shot 2015-10-28 at 2.00.59 PM

That is it. Now if someone receives a text message with a Flickr link, she will jump right to the Flickr app when she taps on it.

Deep linking into the app

Great! We have Flickr photos showing up as search results and Flickr URLs opening directly in our app. Now we just have to get the user to the proper place within the app. There are different entry points into our app, and we need to make the implementation consistent and avoid code duplication.

iOS has been supporting deep linking for a while already and so has Flickr. To support deep linking, apps could register to handle custom URLs (meaning a custom scheme, such as myscheme://mydata/123). The website corresponding to the app could then publish links directly to the app. For every custom URL published on the Flickr website, our app translates it into a representation of the data to be shown. This representation looks like this:

@interface FLKRoute : NSObject

@property (nonatomic) FLKRouteType type;
@property (nonatomic, copy) NSString *identifier;


It describes the type of data to present, and a unique identifier for that type of data.

- (void)navigateToRoute:(FLKRoute *)route
    switch (route.type) {
        case FLKRouteTypePhoto:
            // Navigate to photo screen
        case FLKRouteTypeAlbum:
           // Navigate to album screen
        case FLKRouteTypeGroup:
            // Navigate to group screen
        // ...

Now, all we have to do is to make sure we are able to translate both NSURLs and NSUserActivity objects into FLKRoute instances. For NSURLs, this translation is straightforward. Our custom URLs follow the same pattern as the corresponding website URLs; their paths correspond exactly. So translating both website URLs and custom URLs is a matter of using NSURLComponents to extract the necessary information to create the FLKRoute object.

As for NSUserActivity objects passed into application:continueUserActivity:restorationHandler:, there are two cases. One arises when the NSUserActivity instance was used to index a public item in the app. Remember that when we created the NSUserActivity object we also assigned its webpageURL? This is really handy because it not only uniquely identifies the data we want to present, but also gives us a NSURL object, which we can handle the same way we handle deep links or Universal Links.

The other case is when the NSUserActivity originated from a CSSearchableItem; we have some more work to do in this case. We need to parse the identifier we created for the item and translate it into a FLKRoute. Remember that our item’s identifier is a combination of its type and the model ID. We can decompose it and then create our route object. Its simplified implementation looks like this:

FLKRoute * FLKRouteFromSearchableItemIdentifier(NSString *searchableItemIdentifier)
    NSArray *routeComponents = [searchableItemIdentifier componentsSeparatedByString:@"/"];
    if ([routeComponents count] != 2) { // type + id
        return nil;
    // Handle the route type
    NSString *searchableItemContentType = [routeComponents firstObject];
    FLKRouteType type = FLKRouteTypeFromSearchableItemContentType(searchableItemContentType);
    // Get the item identifier
    NSString *itemIdentifier = [routeComponents lastObject];
    // Build the route object
    FLKRoute *route = [FLKRoute new];
    route.type = type;
    route.parameter = itemIdentifier;
    return route;

Now we have all our bases covered and we’re sure that we’ll drop the user in the right place when she lands in our app. The final application delegate method looks like this:

- (BOOL)application:(nonnull UIApplication *)application continueUserActivity:(nonnull NSUserActivity *)userActivity restorationHandler:(nonnull void (^)(NSArray * __nullable))restorationHandler
    FLKRoute *route;
    NSString *activityType = [userActivity activityType];
    NSURL *url;
    if ([activityType isEqualToString:CSSearchableItemActionType]) {
        // Searchable item from Core Spotlight
        NSString *itemIdentifier = [userActivity.userInfo objectForKey:CSSearchableItemActivityIdentifier];
        route = FLKRouteFromSearchableItemIdentifier(itemIdentifier);
    } else if ([activityType isEqualToString:@"FLKSearchableUserActivityType"] ||
               [activityType isEqualToString:NSUserActivityTypeBrowsingWeb]) {
        // Searchable item from NSUserActivity or Universal Link
        url = userActivity.webpageURL;
        route = [url flk_route];
    if (route) {
        [self.router navigateToRoute:route];
        return YES;
    } else if (url) {
        [[UIApplication sharedApplication] openURL:url]; // Fail gracefully
        return YES;
    } else {
        return NO;

3D Touch

With the release of iPhone 6S and iPhone 6S Plus, Apple introduced a new gesture that can be used with your iOS app: 3D Touch. One of the coolest features it has brought is the ability to preview content before pushing it onto the navigation stack. This is also known as “peek and pop.”

You can easily see how this feature is implemented in the native Mail app. But you won’t always have a simple UIView hierarchy like Mail’s UITableView, where a tap anywhere on a cell opens a UIViewController. Take Flickr’s notifications screen, for example:


If the user taps on a photo in one of these cells, it will open the photo view. But if the user taps on another user’s name, it will open that user’s profile view. Previews of these UIViewControllers should be shown accordingly. But the “peek and pop” mechanism requires you to register a delegate on your UIViewController with registerForPreviewingWithDelegate:sourceView:, which means that you’re working in a much higher layer. Your UIViewController’s view might not even know about its subviews’ structures.

To solve this problem, we used UIView’s method hitTest:withEvent:. As the documentation describes, it will give us the “farthest descendant of the receiver in the view hierarchy.” But not every hitTest will necessarily return the UIView that we want. So we defined a protocol, FLKPeekAndPopTargetView, that must be implemented by any UIView subclass that wants to support peeking and popping from it. That view is then responsible for returning the model used to populate the UIViewController that the user is trying to preview. If the view doesn’t implement this protocol, we query its superview. We keep checking for it until a UIView is found or there aren’t any more superviews available. This is how this logic looks:

+ (id)modelAtLocation:(CGPoint)location inSourceView:(UIView*)sourceView
    // Walk up hit-test tree until we find a peek-pop target.
    UIView *testView = [sourceView hitTest:location withEvent:nil];
    id model = nil;
    while(testView && !model) {
        // Check if the current testView conforms to the protocol.
        if([testView conformsToProtocol:@protocol(FLKPeekAndPopTargetView)]) {
            // Translate location to view coordinates.
            CGPoint locationInView = [testView convertPoint:location fromView:sourceView];
            // Get model from peek and pop target.
            model = [((id<FLKPeekAndPopTargetView>)testView) flk_peekAndPopModelAtLocation:locationInView];
        } else {
            //Move up view tree to next view
            testView = testView.superview;
    return model;

With this code in place, all we have to do is to implement UIViewControllerPreviewingDelegate methods in our delegate, perform the hitTest and take the model out of the FLKPeekAndPopTargetView‘s implementor. Here’s is the final implementation:

- (UIViewController *)previewingContext:(id<UIViewControllerPreviewing>)previewingContext
              viewControllerForLocation:(CGPoint)location {
    id model = [[self class] modelAtLocation:location inSourceView:previewingContext.sourceView];
    UIViewController *viewController = nil;
    if ([model isKindOfClass:[FLKPhoto class]]) {
        viewController = // ... UIViewController that displays a photo.
    } else if ([model isKindOfClass:[FLKAlbum class]]) {
        viewController = // ... UIViewController that displays an album.
    } else if ([model isKindOfClass:[FLKGroup class]]) {
        viewController = // ... UIViewController that displays a group.
    } // ...
    return viewController;

- (void)previewingContext:(id<UIViewControllerPreviewing>)previewingContext
     commitViewController:(UIViewController *)viewControllerToCommit {
    [self.navigationController pushViewController:viewControllerToCommit animated:YES];

Last but not least, we added support for Quick Actions. Now the user has the ability to quickly jump into a specific section of the app just by pressing down on the app icon. Defining these Quick Actions statically in the Info.plist file is an easy way to implement this feature, but we decided to go one step further and define these options dynamically. One of the options we provide is “Upload Photo,” which takes the user to the asset picker screen. But if the user has Auto Uploadr turned on, this option isn’t that relevant, so instead we provide a different app icon menu option in its place.

Here’s how you can create Quick Actions:

NSMutableArray<UIApplicationShortcutItem *> *items = [NSMutableArray array];
[items addObject:[[UIApplicationShortcutItem alloc] initWithType:@"FLKShortcutItemFeed"
                                                  localizedTitle:NSLocalizedString(@"Feed", nil)]];
[items addObject:[[UIApplicationShortcutItem alloc] initWithType:@"FLKShortcutItemTakePhoto"
                                                  localizedTitle:NSLocalizedString(@"Upload Photo", nil)] ];

[items addObject:[[UIApplicationShortcutItem alloc] initWithType:@"FLKShortcutItemNotifications"
                                                  localizedTitle:NSLocalizedString(@"Notifications", nil)]];
[items addObject:[[UIApplicationShortcutItem alloc] initWithType:@"FLKShortcutItemSearch"
                                                  localizedTitle:NSLocalizedString(@"Search", nil)]];
[[UIApplication sharedApplication] setShortcutItems:items];

And this is how it looks like when the user presses down on the app icon:


Finally, we have to handle where to take the user after she selects one of these options. This is yet another place where we can make use of our FLKRoute object. To handle the app opening from a Quick Action, we need to implement application:performActionForShortcutItem:completionHandler: in the app delegate.

- (void)application:(UIApplication *)application performActionForShortcutItem:(UIApplicationShortcutItem *)shortcutItem completionHandler:(void (^)(BOOL))completionHandler {
    FLKRoute *route = [shortcutItem flk_route];
     [self.router navigateToRoute:route];


There is a lot more to consider when shipping these features with an app. For example, with Flickr, there are various platforms the user could be using. It is important to make sure that the Spotlight index is up to date to reflect changes made anywhere. If the user has created a new album and/or left a group from his desktop browser, we need to make sure that those changes are reflected in the app, so the newly-created album can be found through Spotlight, but the newly-departed group cannot.

All of this work should be totally opaque to the user, without hogging the device’s resources and deteriorating the user experience overall. That requires some considerations around threading and network priorities. Network requests for UI-relevant data should not be blocked because we have other network requests happening at the same time. With some careful prioritizing, using NSOperationQueue and NSURLSession, we managed to accomplish this with no major problems.

Finally, we had to consider privacy, one of the pillars of Flickr. We had to be extremely careful not to violate any of the user’s settings. We’re careful to never publicly index private content, such as photos and albums. Also, photos marked “restricted” are not publicly indexed since they might expose content that some users might consider offensive.

In this blog post we went into the basics of integrating iOS 9 Search, Universal Links, and 3D Touch in Flickr for iOS. In order to focus on those features, we simplified some of our examples to demonstrate how you could get started with them in your own app, and to show what challenges we faced.

Flickr September 2014

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring mobile, back-end and front-end engineers, in our San Francisco office. Find out more at

Perceptual Image Compression at Flickr

Archie Russell, Peter Norby, Saeideh Bakhshi

At Flickr our users really care about image quality.  They also care a lot about how responsive our apps are.  Addressing both of these concerns simultaneously is challenging;  higher quality images have larger file sizes and are slower to transfer.   Slow transfers are especially noticeable on mobile devices.   Flickr had historically aimed for high quality at the expense of larger files, but in late 2014 we implemented a method to both maintain image quality and decrease the byte-size of the images we serve to users.   As image appearance is very important to our users,  we performed an extensive user test before rolling this change out.   Here’s how we did it.

Background:  JPEG Quality Settings

Fig 1.    JPEG settings vs file size for a test image.

JPEG compression has several tuneable knobs.   The q-value is the best known of these; it adjusts the level of spatial detail stored for fine details;  a higher q-value typically keeps more detail.    However,  as q-value gets very close to 100,  file size increases dramatically,  usually without improving image appearance.

If file size and app performance isn’t an issue,  dialing up q-value is an easy way to get really nice-looking images; this is what Flickr has done in the past.    And if appearance isn’t very important,  dialing down q-value is a viable option.    But if you want both,  you’re kind of stuck.   Additionally,  q-value isn’t one-size-fits-all,  some images look great at q-value 80 while others don’t.

Another commonly adjusted setting is chroma-subsampling,  which alters the amount of color information stored in a JPEG file.    With a setting of 4:4:4,  the two chroma (color) channels in a JPG have as much information as the luminance channel.   In an image with a setting of 4:2:0, each chroma channel has only a quarter as much information as in an a 4:4:4 image.

 q=96,  chroma=4:4:4 (125KB) q=70, chroma=4:4:4 (67KB)
q=96, chroma=4:2:0 (62KB)  q=70, chroma=4:2:0 (62KB)

Table 1:   JPEG stored at different quality and chroma levels.   The upper left image is saved at high quality and chroma level; notice the color and detail in the folds of the red flag.   The lower right image has the lowest quality;  notice artifacts along the right edges of the red flag.

Perceptual JPEG Compression

Ideally we’d have an algorithm which automatically tuned all JPEG parameters to make a file smaller, but which would limit perceptible changes to the image.  Technology exists that attempts to do this and can decrease image file size by 30-50%. This compression ratio is highly dependent on image content and dimensions.

compressed: 112KB non-compressed: 224KB

Fig 2. Compressed cropped JPEG is 50% smaller than not-compressed cropped JPEG, above, with no obvious defects.  Compression ratio is similar for a compressed 2048-pixel wide JPEG (475KB) of the entire scene and its corresponding not-compressed JPEG (897KB). 

We were pleased with perceptually compressed images in non-structured examinations.  The compressed images were smaller and nearly indistinguishable from their sources.   But we wanted to really quantify how well the technology worked before considering incorporating it into Flickr.  The standard computational tools for evaluating compression, such as SSIM, are fairly simplistic and don’t do a great job at modeling how a user sees things.  To really evaluate this technology had to use a better measure of perceptibility:  human minds.

The Gamified Taste Test

To test whether our image compression would impact user perception of image quality, we put together a “taste test.”  The taste test is constructed as a game with multiple rounds where users look at both compressed and uncompressed images.  Users accumulate points the longer they play, and get more points for doing well at the game.  We maintained a leaderboard to encourage participation and used only internal testers.The game’s test images came from a diverse collection of 250 images contributed by Flickr staff.  The images came from a variety of cameras and included a number of subjects from photographers with varying skill levels.

sampling of images used in taste test
Fig 3. A sampling of images used in our taste test.

In each round, our test code randomly select a test image, and present two variants of this image side by side.  50% of the time we present the user two identical images; the rest of the time we present one compressed image and one uncompressed image.  We ask the tester if the two images look the same or different and we’d expect a user choosing randomly OR a user unable to distinguish the two cases would answer correctly about half the time.  We randomly swap the location of the compressed images to compensate for user bias to the left or the right.  If testers choose correctly, they are presented with a second question: “Which image did you prefer, and why?”

two kittens in a video game
Fig 4. Screenshot of taste test.

Our test displays images simultaneously to prevent testers noticing a longer load time for the larger, non-compressed image.  The images are presented with either 320, 640, or 1600 pixels on their longest side.  The 320 & 640px images are shown for 12 seconds before being dimmed out.  The intent behind this detail is to represent how real users interact with our images.  The 1600px images stay on screen for 20 seconds, as we expect larger images to be viewed for longer periods of time by real users.   We award 100 points per round, regardless of whether a tester chose correctly and also award a bonus of 400 points when a tester correctly identifies whether images were identical or different.  We update the tester’s score every five tests so that the user perceives an increasing score without being rewarded immediately for any particular behavior.

Taste Test Outcome and Deployment

We ran our taste test for two weeks and analyzed our results.    Although we let users play as long as they liked,  we skipped the first result per user as a “warm-up” and considered only the subsequent ten results,  this limited the potential for users training themselves to spot compression artifacts.   We disregarded users that had fewer than eleven results.

images total results # labeled “identical” by tester % labeled “identical” by tester
two identical images 368 253 68.8%
one compressed, one non-compressed 352 238 67.6%

Table 2.   Taste test results.   Testers select “identical” at nearly the same rate, whether the input is identical or not.

When our testers were presented with two identical images, they thought the images were identical only 68.8% of the time(!), and when presented with a compressed image next to a non-compressed image,  our testers thought the images were identical slightly less often:  67.6% of the time.  This difference was small enough for us,  and our statisticians told us it was statistically insignificant.  Our image pairs were so similar that multiple testers thought all images were identical and reported that the test system was buggy. We inspected the images most often labeled different, and found no significant artifacts in the compressed versions.

So even in this side-by-side test,  perceptual image compression is just barely noticeable when images are presented side-by-side.  As the Flickr website wouldn’t ever show compressed and uncompressed images at the same time, and the use of compression had large benefits in storage footprint and site performance, we elected to go forward.

At the beginning of 2014 we silently rolled out perceptual-based compression on our image thumbnails (we don’t alter the “original” images uploaded by our users).  The slight changes to image appearance went unnoticed by users, but user interactions with Flickr became much faster,  especially for users with slow connections, while our storage footprint became much smaller.  This was a best-case scenario for us.

Evaluating perceptual compression was a considerable task,  but it gave the confidence we needed to apply this compression in production to our users.    This marked the first time Flickr had adjusted image settings in years, and, it was fun.
High Score List
Fig 5.  Taste test high score list


After eighteen months of perceptual compression at Flickr,  we adjusted our settings slightly to shrink images an additional 15%.   For our users on mobile devices,  15% fewer bytes per image makes for a much more responsive experience.We had run a taste test on this newer setting and users were were able to spot our compression slightly more often than with our original settings.   When presented a pair of identical images, our testers declared these images identical 65.2% of the time,  when presented with different images,  of our testers declared the images identical 62% of the time.   It wasn’t as imperceptible as our original approach, but, we decided it was close enough to roll out.

Boy were we wrong!   A few very vocal users spotted the compression and didn’t like it at all.    The Flickr Help Forum had a very lively thread which Petapixel picked up.  We beat our heads against the wall considered our options and came up with a middle path between our initial and follow-on approaches,  giving us smaller, faster-to-load files while still maintaining the appearance our users expect.

Through our use of perceptual compression,  combined with our use of on-the-fly resize and COS,  we’ve been able to decrease our storage footprint dramatically, while simultaneously improving user experience. It’s a win all around but we’re not done yet — we still have a few tricks up our sleeves.

The Data Freshener

Hello Kitty car air freshener
So fresh



You may have noticed some changes in Flickr a couple months back. Like, half the site changed. 95% even, by some metrics. Some say CHANGE IT BACK! while others welcome change. Whatever your thoughts, the changes are here, and they mean things. For example, they mean new visual design and better usability. They mean a faster site. Unfortunately, up until recently, they also meant more stale data. Yuck.


Why? What? Well…here’s the deal. We have a new-ish frontend stack we’ve been using for the past couple years now. It’s an isomorphic single-page application, runs on node.js, and is generally awesome. We call it Reboot.

hi there / i am the computer

In the World of Reboot, we treat data with kid gloves. We <3 data. We never want to give it up, never want to let it down. Once we pull data from our APIs, we store the fetched data in your browser so that we don’t have to fetch it again the next time it’s needed. This means faster page loads and faster navigation, and less API traffic (and thus a more stable and scalable API). The data cached in your browser exists as long as the current Reboot session — until you refresh or leave Reboot for a non-Rebooted page.

However, this also meant that data could become stale. You change the date taken of your photo, someone else adds a comment, you navigate to a page with cached data…and you don’t see the changes. Wat? Yeah. So, this was not a huge problem until we moved lots of pages onto Reboot in the beginning of May. From that point forward, most Flickr user sessions have spent their entirety on Reboot, feeding off the same stale loaves of cached data.


The thinking (design / prototypes)

We considered a number of possibilities for freshening up data during a user session. A brief history of the strategies we sampled, and their results:

1. Refresh on update

Ice Tea

The first stab focused on updating data locally after it was changed by the user. Most of our simpler use cases already updated as expected, but some trickier cases with indirect relationships did not. For example, changing the date taken of a photo updated the data model for the photo, but deleting a photo did not necessarily ensure the photo was removed from all the cached albums, groups, and galleries to which it belonged. (Note that the photo was removed correctly from the backend, just not from the cached representation of those entities on the client.)

Cleaning up these relationships using change events between models helped, but didn’t solve all our problems. When someone outside of the local session (read: another user) changed data, it would not reflect in the current session. The only way to catch changes from outside the current session was to be more aggressive about evicting models.

2. Nuclear option

Atomic Bomb Test

The pendulum swung all the way in the other direction — instead of surgical removal of data models we knew to be out-of-date, what would happen if we removed all cached data on every navigation? This prototype was quick to build, and incredibly destructive. By doing this, all our cached data always remained as fresh as could be, but we essentially reverted to Web 1.0 — with the exception of the Reboot framework, everything was reloaded on every page.

Not surprisingly, this blew up API traffic (locally only! did not unleash that disaster at scale), and inflated page load times like a Jeff Koons sculpture. It did give us some baseline timing metrics we could point to as worst-case scenarios, however. The next step was to swing the pendulum back toward the middle — to a carefully-knitted solution that would preserve fast page loads and navigation, while ensuring the freshest data we could serve up.

3. Refetch on navigate


At this point, our challenge was to find a solution that would keep navigation fast, API traffic slim, and pick up all changes to session data, whether local or remote. We ended up with a solution we call “refetching”: evicting and requesting new data models as the model is needed by the application. But when?
We could refetch periodically or on a user action; we determined that the best time to trigger a refetch was on navigation — when the user navigates, cached models become eligible for refetching. Specifically, when the user navigates between sections of the site, refetching is triggered. This proved to be the happiest medium between speed and freshness.

A high-level outline of how the refetching strategy works:

  • The user loads a page; data are requested from the API, and models are cached. As new models are created, they’re marked as being fresh.
  • The user navigates to another site section (e.g. Photostream → Search); all freshness marks are removed from all models. They’re now all eligible for refetching.
  • As Reboot builds the new page, it requests data models from the cache. Since they no longer have their seal of freshness, they are refetched, and marked as fresh once retrieved and cached.

One important note — refetching is not triggered on browser back/forward navigation. Users expect near-immediate navigation, thanks to browser caching, when navigating to already-viewed content. Therefore, we refetch only when the user clicks a link to navigate to a new site section.

4. Miscellany

There were a couple other options we considered and rejected from the start, but they’re worth mentioning here.

One was a TTL (time-to-live) algorithm, commonly used in caching applications. TTL algorithms expire data and evict from the cache a certain amount of time after they’re written or last updated. The arbitrary nature of TTL would mean that users would sometimes have fresh data and sometimes stale; it would be fresh more often than without any solution, but freshness would vary arbitrarily and would not result in much of an improvement on user experience.

The other was to write an algorithm that tracks the amount of time since a data model was last accessed, and refetch when it grows too old. While this sounded interesting at first, it has the same flaw as a standard TTL algorithm — freshness becomes arbitrary. It’s also more complex to implement, and might end up not being worth the complexity.

The doing (implementation)

So that was it! Refetch on navigate, all done. Right?….of course not. With the general strategy in place, the devil started sneaking around in all the details. Some of the highlights:


It proved to be not the best idea to evict on all navigation. For example, in Reboot we often preload photo metadata models on pages with lists of photos, in order to make navigation into the photo page snappy. The refetch setup therefore has an exemption config that allows us to easily retain models when navigating into, away from, or between specific site sections.

Child models

We often have parent-child associations between data models. For example, the data model for a photo has a reference to a data model for the author of the photo. When the photo model is refetched, the person model must be refetched as well. This means the function doing the eviction and refetching has to recurse through all child models.


An issue similar to child models above, but more complex, is the case of a model containing a list of other models. For example, the data model for a person’s photostream contains a list of photo models.

What made this particularly tricky is pagination and filtering — say you load the first 2 pages of your photostream, set your view filter to private, jump to page 5, switch the view to “Date taken”, and navigate away and back to your photostream…imagine the mess of different models with partially-loaded collections. Evicting one parent model, and its children, might evict photo models from the collection within another, without properly refetching. The solution here actually lay in the controller responsible for fetching pages: if a requested page of models is not already completely in-cache, a refetch will always happen to ensure we have all the data, in its freshest state.

Refetch only once per page view

Critical to the refetch-on-navigation strategy is to refetch only once per navigation. This was not too difficult, but essential to get right. We accomplish this by adding a flag when a model is initially fetched and upserted into the cache. When navigating to a new, non-exempt site section, all those flags are cleared, and any model requested by the new page will be refetched. When refetched, the model is again upserted into the cache and marked as fresh, until the next navigation.

But did it fresh?

Go on without me

With the thinking and the doing out of the way, it was time to push all this to production. Because these changes are essentially pulling the rug out from underneath the data layer on every navigation, we had to tread very carefully in order to prevent any negative impact to the end user experience.

We did very thorough manual and automated testing across all of Reboot. We left the feature turned on for staff users for a while, to be able to respond to any bug reports. Finally, the time came to test on Real People. There were three things we needed to keep an eye on: errors (of course), impact on page navigation timing, and API traffic. Since refetching implies more requests for data, we needed to be sure that we were keeping the user experience smooth and fast, and also that we weren’t blowing up our data centers.

All In
All in

In order to get a good read on these things, though, we had to go all in. Letting in just a small percentage of users would not give reliable numbers for timing or traffic impacts, due to the noise inherent in relatively small sample sizes. So, we did something unusual: we turned on refetching for all users for a short period of time. We flipped on refetching and kept an eagle eye on our stats for 2 hours, then reverted; then, we took a careful look at the aggregated data to see how the experiment went.

Surprisingly, the impact on both timing and traffic was relatively low. After some thought, we decided this is most likely because the changes disproportionately impact people on long sessions, say a Flickr tab open for hours or days. Most people don’t hang around that long; they come, they go. Also, the photo page represents north of 90% of our page views, and is exempt from refetching (see Exemptions above).

So where did we end up? A negligible bump in navigation timing and API traffic, and fresher data for all. Perhaps an anticlimactic resolution, but the story we’ve heard today outlines a serious consideration for anyone building an application with a data caching layer: keep in mind from the beginning how you plan to deal with stale data, but in a way that keeps all the other benefits of a single-page application.

#CCC is a breadcat
Busting through staleness. Yep.

Real-time Resizing of Flickr Images Using GPUs

At Flickr we work with a huge number of photos. Our users upload over 27 million photos a day, and our total collection has over 12 billion photos. This is fantastic! As usage grows, we are always looking for ways to use our storage more efficiently. Recently our storage team wrote about some new commodity storage technology now in use at Flickr which increases efficiency. But we also looked into how much data we store for each photo. In the past we stored many sizes of every photo to make serving fast. We wanted to challenge that model and find the minimal set of data to store.

Thumbnail Footprint Reduction

One of our biggest opportunities for byte per photo improvement is through reduction in the footprint of Flickr’s “thumbnails”. Thumbnail is a bit of a misnomer at Flickr; our thumbnails are as large as 2048 pixels on their longest side, so at Flickr we usually refer to these as resizes.  We create these resizes in order to provide a consistent,  fast experience for our users over a variety of use cases.

Different sizes used in different contexts. From left to right: Cameraroll uses small thumbnails, to enable fast navigation through many sizes. Our Photo Page uses our largest, most detailed sizes. Search uses sizes in between these two extremes. Red panda photos by Mathias Appel.

The selection of sizes has grown semi-organically over the years, and all told, we serve eleven different resizes per photo which, in sum, use nearly as much storage as the original photo. Almost 90% of this storage is held in the handful of resizes 640px and larger, so we targeted our efforts at eliminating some of these sizes.

Left: Distribution of byte size by resize dimension. Storage is concentrated in images with largest dimensions. Right: size distribution after largest sizes eliminated.

A Few Approaches

A simple approach to this problem would be just to cease offering some of the larger sizes.    For instance, we could drop the 1600px image from our API and require the design to adjust.   However, this requires compromises that we didn’t want to take on. Instead we took on a pretty ambitious goal: maintain our largest resize, usually 2048px wide, as a source image and create any other moderate or large-sized resizes on-the-fly from this source, without sacrificing image quality or significantly affecting performance. Using the original uploaded photo as a resize source image was impractical, as these can be very large and exist in a variety of formats.

Sounds easy, right? We already resize images when users upload, so why not just use that same technology on serving. Well, almost. The problem with the naive approach is that high-quality resizing of JPEGs is a lot slower than is widely known. A tool we use frequently, GraphicsMagick, produces beautiful images but takes over 225ms to resize a 2048px JPEG down to 1600px, depending on quality settings. This is slow enough that this method would impact user experience, and would require many CPUs to handle our load. Ymagine,  a high-performance CPU-based tool we’ve open sourced,  is twice as fast as GraphicsMagick(!). We use Ymagine extensively on smaller images, but for the large sizes we’re targeting we needed even more performance. A GPU-based solution ultimately filled our needs.

Our GPU-based Solution

We created a tier of dedicated resize servers, each with an GPU co-processor. Each of these boards has two GPUs, each with 1500+ “cores”, running at just under 1GHz. These cores aren’t anywhere near as performant as a CPU core, but there are many of them. We tested a range of server-grade boards to find the best performing type for our workload. Many manufacturers offer consumer-grade boards with incredible specifications and lower price points, but these lack server-grade cooling and other features such as ECC RAM. One member of our team had experience using these lower grade boards in a previous application and recommended against it.

Resize system architecture

On these resize servers we run a fairly vanilla Apache with a plugin written in C++.  This server responds to resize requests, reads our source image from disk into shared memory,  and hands off requests off to persistent resize daemons that do all communication with our  GPUs.  A daemon-type approach is necessary due to a somewhat lengthy initialization process with our GPUs.

Our resize daemons transfer JPEGs from shared memory to GPU device memory. Once here,  the real image processing takes place. The JPEGs are decoded, cropped, sharpened, resized, re-sharpened as needed, re-encoded as JPEGs, and finally transferred back to shared memory.    From shared memory, our Apache module returns the resized JPEG to the caller.

A simple resize pipeline. Post-sharpening overcomes fuzziness introduced when downscaling.

There are several accepted resize algorithms, but to retain the Flickr “look”, we implemented the same Lanczos resize and kernel sharpening algorithms that we’ve used for years in CUDA.     This had the added benefit of being able to directly compare images generated through GraphicsMagick and our GPU-based code.


With significant optimization, this code is able to resize our 2048px JPEGs to 1600px in under 16ms. This is more than 15x faster than GraphicsMagick and nearly 10x faster than Ymagine.  Resizes from 2048px to 640px take under 10ms. Equally noteworthy,  at peak load,  each resize server can perform over 300 resizes per second.

Performance of different resize approaches.

Although these timings are quite fast,  the source image for our resizes  is larger, byte-wise, than the images it is resizing to,  requiring additional I/O. For example,  a typical 2048px source JPEG is roughly 600kB and our typical 1024px JPEGs are just under 200kB. This difference in size leads to roughly 35ms additional I/O time per resize.

Taking it slow

As our GPU code is new and images are our most important product, this change carries some risk. We’ve addressed this with extensive testing, progressive rollout and provisions for rollback. We also used some insights into our user behavior to roll this solution out in a very controlled manner.


This system is currently in production and as we roll it out more fully, has the potential to cut the resize footprint of the majority of our photos by 50%, with negligible impact on performance and image appearance. We also have the ability to apply this same footprint reduction technique to images uploaded in the past, which has the potential to reduce our storage growth to zero for a significant period of time.


This project would not have been possible with hard work of Peter Norby, Tague Griffith, John Ko and many others.

Introducing: Flickr PARK or BIRD

park OR bird
Zion National Park Utah by Les Haines Creative Commons License Secretary Bird by Bill Gracey Creative Commons License

tl;dr: Check it out at!

We at Flickr are not ones to back down from a challenge. Especially when that challenge comes in webcomic form. And especially when that webcomic is xkcd. So, when we saw this xkcd comic we thought, “we’ve got to do that”:

Creative Commons License

In fact, we already had the technology in place to do these things.  Like the woman in the comic says, determining whether a photo with GPS info embedded into it was taken in a national park is pretty straightforward. And, the Flickr Vision team has been working for the last year or so to be able to recognize more than 1000 things in images using deep convolutional neural nets. Incidentally, one of the things we’re pretty good at recognizing is birds!

We put those things together, and thus was born!

Recognizing Stuff in Images with Deep Networks

The thing we’re really excited to show off with PARK or BIRD is our image recognition technology. To recognize 1000+ things, we employ a deep convolutional neural network similar to the one depicted below.


This model transforms an input image into a representation in which different objects and scenes are easily distinguishable by a simple binary classification algorithm, like an SVM. It does this by passing the image through a series of layers, where each layer computes a function of the output of the layer below it.

Each successive one of these layers, after training on millions of images, has learned to recognize higher- and higher-level features of images and the ways these features go together to form different objects and scenes. For example, the first layer might recognize the most basic image features, such as short straight lines, corners, and small circular arcs. The next layer might recognize higher level combinations of those features, such as circles or other basic shapes. Further layers might recognize higher-level concepts, like eyes and beaks, and even further ones might recognize heads and wings. For an example of what this looks like, check out Figure 2 in this paper by Matt Zeiler and Rob Fergus.

As the image passes through these layers, they are “activated” in different ways depending on the features they’ve seen in the input image, and at the top of this network—after the image is transformed by the bottom layer, and that transformation of the image is transformed by the next layer, and that transformation of the transformation of the image is transformed by the next layer, and so on— a short floating-point vector summarizing all of the various activations at each layer is output. We pass this floating-point vector into more than 1000 binary classifiers, each of which is trained to give us a yes/no answer to identify a specific object/scene class. And, of course, one of those classes is birds!

The Flickr Vision team is already applying this deep network to Flickr photos to help people more more easily find what they’re looking for via Flickr search, and we plan to integrate it into Flickr in other cool ways in the future. We’re also working on other innovative computer vision and image recognition technologies that will make it easier for Flickr members to find and organize their photos.


The Flickr Vision and Search team is awesome and PARK or BIRD is built upon technologies that we all pitched in on. Here we all are (at least most of us), in all our beautiful glory. Thanks Vision/Search! Thanks also to Stephen Woods, Bart Thomee, John Ko, Mike Shema, and Sean Perkins, all of whom provided a lot of help getting PARK or BIRD off the ground.

Flickr flamily floto

If this all sounds like a challenge you’re interested in helping out with, you should join us! Flickr is hiring engineers, designers and product managers in our San Francisco office. Find out more at

Performance improvements for photo serving

We’ve been working to make Flickr faster for our users around the world. Since the primary photo storage locations are in the US, and information on the internet travels at a finite speed, the farther away a Flickr user is located from the US, the slower Flickr’s response time will be. Recently, we looked at opportunities to improve this situation. One of the improvements involves keeping temporary copies of recently viewed photos in locations nearer to users.  The other improvement aims to get a benefit from these caches even when a user views a photo that is not already in the cache.

Regional Photo Caches

For a few years, we’ve deployed regional photo caches located in Switzerland and Singapore. Here’s how this works. When one of our users in Vietnam requests a photo, we copy it temporarily to Singapore. When a second user requests the same photo, from, say, Kuala Lumpur, the photo is already present in Singapore. Flickr can respond much faster using this copy (only a few hundred kilometers away) instead of using the original file back in the US (over 8,000 km away).

The first piece of our solution has been to create additional caches closer to our users. We expanded our regional cache footprint around two months ago. Our Australian users, among others, should now see dramatically faster load times. Australian users will now see the average image load about twice as fast as it did in March.

We’re happy with this improvement and we’re planning to add more regional caches over the next several months to help users in other regions.

Cache Prefetch

When users in locations far from the US view photos that are already in the cache, the speedup can be up to 10x, but only for the second and subsequent viewers. The first viewer still has to wait for the file to travel all the way from the US. This is important because there are so many photos on Flickr that are viewed infrequently. It’s likely that a given photo will not be present in the cache. One example is a user looking at their Auto Upload album. Auto uploaded photos are all private initially. Scrolling through this album, it’s likely that very few of the photos will be in their regional cache, since no other users would have been able to see them yet.

It turns out that we can even help the first viewer of a photo using a trick called cache warming.

To understand how caching warming works, you need to understand a bit about how we serve images. For example, say that I’m a user in Spain trying to access the photostream of a user, Martin Brock, in the US. When my request for Martin Brock’s Photostream at hits our backend servers, our code quickly determines the most recent photos Martin has uploaded that are visible to me, which sizes will fit best in my browser, and the URLs of those images. It then sends me the list of those URLs in an HTML response. The user’s web browser reads the HTML, finds the image URLs and starts loading them from the closest regional cache.

Standard image fetch

So you’re probably already guessing how to speed things up.  The trick is to take advantage of the time in between when the server knows which images will be needed and the time when the browser starts loading them from the closest cache. This period of time can be in the range of hundreds of milliseconds. We saw an opportunity during this time to send the needed images over to the viewer’s regional cache in advance of their browser requesting the images. If we can “win the race” to do this, the viewer’s experience will be much faster, since images will load from the local cache instead of loading from the US.

To take advantage of this opportunity, we created a new “cache warming” process called The Warmer. Once we’ve determined which images will be requested (the first few photos in Martin’s photostream) we send  a message from the API servers to The Warmer.

The Warmer listens for messages and, based on the user’s location, it determines from which of the Flickr regional caches the user will likely request the image. It then pushes the image out to this cache.

Optimized image fetch, with cache warming path indicated in red

Getting this to work well required a few optimizations.

Persistent connections

Yahoo encrypts all traffic between our data centers. This is great for security, but the time to set up a secure connection can be considerable. In our first iteration of The Warmer, this set up time was so long that we rarely got the photo to the cache in time to benefit a user. To eliminate this cost, we used an Nginx proxy which maintains persistent connections to our remote data centers. When we need to push an image out – a secure connection is already set up and waiting to be used.

Transport layer

The next optimization we made helped us reduce the cost of sending messages to The Warmer.  Since the data we’re sending always fits in one datagram, and we also don’t care too much if a small percentage of these messages are never received, we don’t need any of the socket and connection features of TCP. So instead of using HTTP, we created a simple JSON format for sending messages using UDP datagrams. Another reason we chose to use UDP is that if The Warmer is not available or is reacting slowly, we don’t want that to cause slowdowns in the API.

Queue management

Naturally, some images are quite popular and it would waste resources to push them to the same cache repeatedly. So, the third optimization we applied was to maintain a list of recently pushed images in The Warmer. This simple “de-deduplication” cut the number of requests made by The Warmer by 60%. Similarly, The Warmer drops any incoming requests that are more than fifty milliseconds old. This “time-to-live” provides a safety valve in case The Warmer has fallen behind and can’t catch up.

def warm_up_url(params):
  requested_jpg = params['jpg']

  colo_to_warm = params['colo_to_warm']
  curl = &quot;curl -H 'Host: &quot; + colo_to_warm + &quot;' '&quot; + keepalive_proxy + &quot;/&quot; + requested_jpg + &quot;'&quot;

if __name__ == '__main__':

# create the worker pool

  from multiprocessing.pool import ThreadPool
  worker_pool = ThreadPool(processes=100)

  while True:

    # receive requests
    json_data, addr = sock.recvfrom(2048)

    params = json.loads(json_data)

    requested_jpg = warm_params['jpg']
    colo_to_warm =

    if recently_warmed(colo_to_warm, requested_jpg) :

    if request_too_old(params) :

    # warm up urls
    params['colo_to_warm'] = colo_to_warm

    warm_result = worker_pool.apply_async(warm_up_url,(params,))

Cache Warmer pseudocode


Our initial implementation of the Warmer was in Python, using a ThreadPool. This allowed very rapid prototyping and worked great — up to a point. Profiling the Python code, we found a large portion of time spent in socket calls. Since there is so little code in The Warmer, we tried porting to Java. A nearly line-for-line translation resulted in a greater than 10x increase in capacity.


When we began this process, we weren’t sure whether The Warmer would be able to populate caches before the user requests came in. We were pleasantly surprised when we first enabled it at scale. In the first region where we’ve deployed The Warmer (Western Europe), we observed a reduced median latency of more than 200 ms, 95% of photos requests sped up by at least 100 ms, and for a small percentage of photos we see over 400 ms reduction in latency. As we continue to deploy The Warmer in additional regions, we expect to see similar improvements.

Next Steps

In addition to deploying more regional photo caches and continuing to improve prefetching performance, we’re looking at a few more techniques to make photos load faster.


Overall Flickr uses a light touch on compression. This results in excellent image quality at the cost of relatively large file sizes. This translates directly into longer load times for users. With a growing number of our users connecting to Flickr with wireless devices, we want to make sure we can give users a good experience regardless of whether they have a high-speed LTE connection or two-bars of 3G in the countryside. An important goal will be to make these changes with little or no loss in image quality.

We are also testing alternative image encoding formats (like WebP). Under certain conditions WebP compression may offer better image quality at the same compression ratio than JPEG can achieve.

Geolocation and routing

It turns out it’s not straightforward to know which photo cache is going to give the best performance for a user. It depends on a lot of factors, many of which change over time — sometimes suddenly. We think the best way to do this is with a system that adapts dynamically to “Internet weather.”

Cache intelligence

Today, if a user needs to see a medium sized version of an image, and that version is not already present in the cache, the user will need to wait to retrieve the image from the US, even if a larger version of the image is already in the cache. In this case, there is an opportunity to create the smaller version at the cache layer and avoid the round-trip to the US.

Overall we’re happy with these improvements and we’re excited about the additional opportunities we have to continue to make the Flickr experience super fast for our users. Thanks for following along.

Flickr flamily floto

Like what you’ve read and want to make the jump with us? We’re hiring engineers, designers and product managers in our San Francisco office. Find out more at

Exploring Life Without Compass

Compass is a great thing. At Flickr, we’re actually quite smitten with it. But being conscious of your friends’ friends is important (you never know who they’ll invite to your barbecue), and we’re not so sure about this “Ruby” that Compass is always hanging out with. Then there’s Ruby’s friend Bundler who, every year at the Christmas Party, tells the same stupid story about the time the police confused him with a jewelry thief. Enough is enough! We’ve got history, Compass, but we just feel it might be time to try seeing other people.

Solving for Sprites

In order to find a suitable replacement (and for a bit of closure), we had to find out what kept us relying on Compass for so long. We knew the big one coming in to this great experiment: sprites. Flickr is a huge site with many different pages, all of which have their own image folders that need to be sprited together. There are a few different options for compiling sprites on your own, but we liked spritesmith for its multiple image rendering engines. This gives us some flexibility in dependencies.

A grunt task is available for spritesmith, but it assumes you are generating only one sprite. Our setup is a bit more complex and we’d like to keep our own sprite mixin intact so we don’t actually have to change a line of code. With spritesmith and our own runner to iterate over our sprite directories, we can easily create the sprites and output the dimensions and urls via a simple Handlebars template to a Sass file.

{{#each sprites}}
    {{#each images}}
        %{{../dir}}-{{name}}-dimensions {
            width: {{coords.width}}px;
            height: {{coords.height}}px;
        %{{../dir}}-{{name}}-background {
            background: image-url('{{../url}}') -{{coords.x}}px -{{coords.y}}px no-repeat;

You could easily put all three of these rules in the same declaration, but we have some added flexibility in mind for our mixin.

It’s important to note that, because we’re using placeholders (the % syntax in Sass), nothing is actually written out unless we use it. This keeps our compiled CSS nice and clean (just like Compass)!

@import 'path/to/generated/sprite/file'

@mixin background-sprite($icon, $set-dimensions: false) {
    @extend %#{$spritePath}-#{$icon}-background;

    @if $set-dimensions == true {
        @extend %#{$spritePath}-#{$icon}-dimensions;

Here, our mixin uses the Sass file we generated to provide powerful and flexible sprites. Note: Although retina isn’t shown here, adding support is as simple as extending the Sass mixin with appropriate media queries. We wanted to keep the example simple for this post, but it gives you an idea of just how extensible this setup is!

Now that the big problem is solved, what about the rest of Compass’s functionality?

Completing the Package

How do we account for the remaining items in the Compass toolbox? First, it’s important to find out just how many mixins, functions, and variables are used. An easy way to find out is to compile with Sass and see how much it complains!

sass --update assets/sass:some-temp-dir

Depending on the complexity of your app, you may see quite a lot of these errors.

error assets/css/base.scss (Line 3: Undefined mixin 'font-face'.)

In total, we’re missing 16 mixins provided by Compass (and a host of variables). How do we replace all the great mixin functionality of Compass? With mixins of the same name, node-bourbon is a nice drop-in replacement.

What is the point of all this work again?

The Big Reveal

Now that we’re comfortably off Compass, how exactly are we going to compile our Sass? Well try not to blink, because this is the part that makes it all worthwhile.

Libsass is a blazing-fast C port of the Sass compiler that exposes bindings to modules like node-sass.

Just how fast? With Compass, our compile times were consistently around a minute and a half to two minutes. Taking care of spriting ourselves and using libsass for Sass compilation, we’re down to 5 seconds. When you deploy as often as we do at Flickr (in excess of 10 times a day), that adds up and turns into some huge savings!

What’s the Catch?

There isn’t one! Oh, okay. Maybe there are a few little ones. We’re pretty willing to swallow them though. Did you see that compile time?!

There are some differences, particularly with the @extend directive, between Ruby Sass and libsass. We’re anticipating that these small kinks will continue to be ironed out as the port matures. Additionally, custom functions aren’t supported yet, so some extensibility is lost in coming from Ruby (although node-sass does have support for the image-url built-in which is the only one we use, anyway).

With everything taken into account, we’re counting down the days until we make this dream a reality and turn it on for our production builds.

Flickr flamily floto

Like what you’ve read and want to make the jump with us? We’re hiring engineers, designers and product managers in our San Francisco office. Find out more at

Redis Sentinel at Flickr


We recently implemented Redis Sentinel at Flickr to provide automated Redis master failover for an important subsystem and we wanted to share our experience with it. Hopefully, we can provide insight into our experience adopting this relatively new technology and some of the nuances we encountered getting it up and running. Although we try to provide a basic explanation of what Sentinel is and how it works, anyone who is new to Redis or Sentinel should start with the excellent Redis and Sentinel documentation.

At Flickr we use an offline task processing system that allows us to execute heavyweight operations asynchronously from our API and web pages. This prevents these operations from making users wait needlessly for pages to render or API methods to return. Our task system handles millions of tasks per day which includes operations like photo uploads, user notifications and metadata edits. In this system, code can push a task onto one of several Redis-backed queues based on priority and operation, then forget about the task. Many of these operations are critical and we need to make sure we process at least 99.9999% of them (less than 1 in 1 million dropped). Additionally, we need to make sure this system is available to insert and process tasks at least 99.995% of the time – no more than about 2 minutes a month downtime.

Until a few months ago our Redis BCP consisted of:

Upon master failure, the recovery plan included several manual steps: reconfiguring code to take the Redis master(s) offline and manually promoting a Redis slave (a mildly time consuming activity). Then we would rebuild and backfill unprocessed data from AOF files and error logs — a very time consuming activity. We knew if we lost a master we would have hours and hours of less-than-satisfying work to run the recovery plan, and there was potential for user impact and even a small amount of data loss. We had never experienced a Redis master failure, but we all know that such events are simply a matter of time. Overall, this fell far short of our durability and availability goals.

Configuring Sentinel

We started by installing and testing Sentinel in a development environment and the first thing we noticed was how simple Sentinel is to use and how similar the syntax is to Redis. We read Aphyr’s article and his back-and-forth blog duel with Salvatore and verified Aphyr’s warning about the “split brain” scenario. Eventually we decided the benefits outweighed the risks in our specific use case. During testing we learned about some Sentinel nuances and got a better feel for appropriate configuration values, many of which have little or no community guidance yet.

One such example was choosing a good value for the level-of-agreement setting, which is the number of Sentinels simultaneously reporting a host outage before automatic failover starts. If this value is too high then you’ll miss real failures and if it’s too low you are more susceptible to false alarms. (*thanks to Aleksey Asiutin(@aasiutin) for the edit!) In the end, we looked at the physical topology of our hosts over rack and switches and chose to run a relatively large number of Sentinel instances to ensure good coverage. Based on tuning in production we chose a value for level-of-agreement equal to about 80% of the Sentinel instances.

The down-after-milliseconds configuration setting is the time the Sentinels will wait with no response to their ping requests before declaring a host outage. Sentinels ping the hosts they monitor approximately every second, so by choosing a value of 3,100 we expect Sentinels to miss 3 pings before declaring host outage. Interestingly, because of Sentinel’s ping frequency we found that setting this value to less than 1,000 results in an endless stream of host outage notifications from the Sentinels, so don’t do that.  We also added an extra 100 milliseconds (3,100ms rather than 3,000ms) to allow for some variation in Redis response time.

We chose a parallel-syncs value of 1.  This item dictates the number of slaves that are reconfigured simultaneously after a failover event.  If you serve queries from the read-only slaves you’ll want to keep this value low.

For an explanation of the other values we refer you to the self-documented default sentinel.conf file.

An example of the Sentinel configuration we use:

sentinel monitor cluster_name_1 redis_host_1 6390 35
sentinel down-after-milliseconds cluster_name_1 3100
sentinel parallel-syncs cluster_name_1 1

sentinel monitor cluster_name_2 redis_host_2 6391 35
sentinel down-after-milliseconds cluster_name_2 3100
sentinel parallel-syncs cluster_name_2 1

port 26379

pidfile [path]/
logfile [path]logs/redis/redis-sentinel.log
daemonize yes

An interesting nuance of Sentinels is that they write state to their configuration file. This presented a challenge for us because it conflicted with our change management procedures. How do we maintain a dependably consistent startup configuration if the Sentinels are modifying the config files at runtime? Our solution was to create two Sentinel config files. One is strictly maintained in Git and not modified by Sentinel. This “permanent” config file is part of our deployment process and is installed whenever we update our Sentinel system configuration (i.e.: “rarely”). We then wrote a startup script that first duplicates the “permanent” config file to a writable “temporary” config file, then starts Sentinel and passes it the “temporary” file via command-line params. Sentinels are allowed to modify the “temporary” files as they please.


Interfacing with Sentinel

A common misconception about Sentinel is that it resides in-band between Redis and Redis clients. In fact, Sentinel is out-of-band and is only contacted by your services on startup. Sentinel then publishes notifications when it detects a Redis outage. Your services subscribe to Sentinel, receive the initial Redis host list, and then carry on normal communication directly with the Redis host.

The Sentinel command syntax is very similar to Redis command syntax. Since Flickr has been using Redis for a long time the adaptation of pre-existing code was pretty straightforward for us. Code modifications consisted of adding a few Java classes and modifying our configuration syntax. For Java-to-Redis interaction we use Jedis, and for PHP we use Predis and libredis.

Using Sentinel from Jedis is not documented as well as it could be. Here’s some code that we hope will save you some time:

// Verify that at least one Sentinel instance in the Set is available and responding.
// sentinelHostPorts: String format: [hostname]:[port]
private boolean jedisSentinelPoolAvailable(Set<String> sentinelHostPorts, String clusterName){"Trying to find master from available Sentinels...");
   for ( String sentinelHostPort : sentinelHostPorts ) {
      List<String> hostPort = Arrays.asList( sentinelHostPort.split(":") );
      String hostname = hostPort.get(0);
      int port = Integer.parseInt( hostPort.get(1) );
      try {
         Jedis jedis = new Jedis( hostname, port );
         jedis.sentinelGetMasterAddrByName( clusterName );
         jedis.disconnect();"Connected to Sentinel host:%s port:%d", hostname, port);
         return true;
      } catch (JedisConnectionException e) {
         log.warn("Cannot connect to Sentinel host:%s port:%d”, hostname, port);
   return false;

private Pool<Jedis> getDefaultJedisPool() {
   // Create and return a default Jedis Pool object…
   // ...

// ConfigurationMgr configMgr ⇐ your favorite way of managing system configuration (up to you)
public Pool<Jedis> getPool(ConfigurationMgr configMgr) {
   String clusterName = configMgr.getRedisClusterName();
   Set<String> sentinelHostPorts = configMgr.getSentinelHostPorts();

   if(sentinels.size()>0) {
      if(jedisSentinelPoolAvailable( sentinelHostPorts, clusterName )) {
         return new JedisSentinelPool(clusterName, sentinelHostPorts);
      } else {
         log.warn(“All Sentinels unreachable.  Using default Redis hosts.”);
         return getDefaultJedisPool();
   } else {
      log.warn(“Sentinel config empty.  Using default Redis hosts.”);
      return getDefaultJedisPool();

running with the bulls

Testing Sentinel at Flickr

Before deploying Sentinel to our production system we had several questions and concerns:

  • How will the system react to host outages?
  • How long does a failover event last?
  • How much of a threat is the split-brain scenario?
  • How much data loss can we expect from a failover?

We commandeered several development machines and installed a few Redis and Sentinel instances. Then we wrote some scripts that insert or remove data from Redis to simulate production usage.

Screen Shot 2014-07-23 at 3.19.56 PM

We ran a series of tests on this setup, simulating a variety of Redis host failures with some combination of the commands: kill -9, the Sentinel failover command, and Linux iptables. This resulted in “breaking” the system in various ways.

Screen Shot 2014-07-23 at 3.20.16 PM Figure: Redis master failure

Screen Shot 2014-07-23 at 3.20.41 PM

Figure: Network partition producing a ‘split-brain’ scenario

How will the system react to host outages?

For the most part we found Sentinel to behave exactly as expected and described in the Sentinel docs. The Sentinels detect host outages within the configured down-after-milliseconds duration, then send “subjective down” notifications, then send “objective down” notifications if the level-of-agreement threshold is reached. In this environment we were able to quickly and accurately test our response to failover events. We began with small test scripts, but eventually were able to run repeatable integration tests on our production software. Adding Redis to a Maven test phase for automated integration testing is a backlog item that we haven’t implemented yet.

How long does a failover event last?

The Sentinel test environment was configured with a down-after-milliseconds value of 3,100ms (just like production, see above). With this value Sentinels would produce a host outage notification after approximately 3 unsuccessful pings (one ping per second). In addition to the 3,100ms delay, we found there were 1-3 seconds in overhead for processing the failover event and electing a new Redis master, resulting in 4-6 seconds of total downtime. We are pretty confident we’ll see the same behavior in production (verified — see below).

How much of a threat is the “split-brain” scenario?

We carefully read Aphyr and Salvatore’s blog articles debating the threat of a “split brain scenario.” To summarize: this is a situation in which network connectivity is split, with some nodes still functioning on one side and other nodes continuing to function independently on the other side. The concern is the potential for the data set to diverge with different data being written to masters on both sides of the partition. This could easily create data that is either impossible or very difficult to reconcile.

We recreated this situation and verified that a network partition could create disjoint concurrent data sets. Removing the partition resulted in Sentinel arbitrarily (from our perspective) choosing a new master and losing all data written (post-partitioning) to the other master. So the question is: given our production architecture, what is the probability of this happening and is it acceptable given the significant benefit of automatic failover?

We looked at this scenario in detail considering all the potential failure modes in our deployment. Although we believe our production environment is not immune from split-brain, we are pretty sure that the benefits outweigh the risks.

How much data loss can we expect from a failover event?

After testing we were confident that Redis host outages could produce 4-6 seconds of downtime in this system. Rapid Sentinel automated failover events combined with reasonable backoff and retry techniques in the code logic were expected to further reduce data loss during a failover event. With Sentinel deployed and considering a long history of a highly stable Redis operation, we believed we could achieve 99.995% or more production availability – a few minutes of downtime per year.

Sentinel in Production

So how has Sentinel performed in production? Mostly it has been silent, which is a good thing. A month after finishing our deployment we had a hardware failure in a network switch that had some of our Redis masters behind it. Instead of having a potential scenario involving  tens of minutes of user impact with human-in-the-loop actions to restore service, automatic failover allowed us to limit impact to just seconds with no human intervention. Due to the quick master failover and other reliability features in the code, only 270 tasks failed to insert due to the outage — all of which were captured by logging. Based on the volume of tasks in the system, this met our 99.9999% task durability goal. We did however decide to re-run a couple tasks manually and for certain critical and low-volume tasks we’re looking at providing even more reliability.

One more note from production experience. We occasionally see Sentinels reporting false “subjective down” events. Our Sentinel instances cohabitate host machines with other services. Occasionally these hosts get busy and we suspect these occasional load spikes affect the Sentinels’ ability to send and receive ping requests. But because our level-of-agreement is high, these false alarms do not trigger objective down events and are relatively harmless. If you’re deploying Sentinel on hosts that share other workloads, make sure that you consider potential impact of load patterns on those hosts and make sure you take some time to tune your level-of-agreement.


We have been very happy with Sentinel’s ease of use, relatively simple learning curve and brilliant production execution. So far the Redis/Sentinel combination is working great for us.

Flickr flamily floto

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring engineers, designers and product managers in our San Francisco office. Find out more at


  1. Redis Sentinel Documentation –
  2. Redis Command Reference –
  3. Aphyr: Call me maybe: Redis –
  4. Antirez: Reply to Aphyr Attack on Sentinel –
  5. Jedis Project Page –

A Summer at Flickr

This summer I had the unforgettable opportunity to work side-by-side with the some of the smartest, photography-loving software engineers I’ve ever met. Looking back on my first day at Flickr HQ – beginning with a harmonious welcome during Flickr Engineering’s weekly meeting – I can confidently say that over the past ten weeks I have become a much better software engineer.

One of my projects for the summer was to build a new and improved locking manager that controls the distribution of locks for offline tasks (or OLTs for short). Flickr uses OLTs all the time for data migration, uploading photos, updates to accounts, and more. An OLT needs to acquire a lock on a shared resource, such as a group or an account, to prevent other OLTs from accessing the same resource at the same time. The OLT will then release the lock when it’s done modifying the shared data. Myles wrote an excellent blog post on how Flickr uses offline tasks here.

When building a distributed lock system, we need to take into account a couple of important details. First, we need to make sure that all the lock servers are consistent. One way to maintain consistency is to elect one server to act as a master and the remaining servers as slaves, where the master server is responsible for data replication among the slave servers. Second, we need to account for network and hardware failures – for instance, if the master server goes down for some reason, we need to quickly elect a new master server from one of the slave servers. The good news is, Apache ZooKeeper is an open-source implementation of master-slave data replication, automatic leader election, and atomic distributed data reads and writes.

Offline tasks send lock acquire and release requests through ZooLocker. ZooLocker in turn interfaces with the ZooKeeper cluster to create and delete znodes that correspond to the individual locks.

In the new locking system (dubbed “ZooLocker”), each lock is stored as a unique data node (or znode) on the ZooKeeper servers. When a client acquires a lock, ZooLocker creates a znode that corresponds to the lock. If the znode already exists, ZooLocker will tell the client that the lock is currently in use. When a client releases the lock, ZooLocker deletes the corresponding znode from memory. ZooLocker stores helpful debugging information, such as the owner of the lock, the host it was created on, and the maximum amount of time to hold on to the lock, in a JSON-serialized format in the znode. ZooLocker also periodically scans through each znode in the ZooKeeper ensemble to release locks that are past their expiration time.

My locking manager is already serving locks in production. In spite of sudden spikes in lock acquire and release requests by clients, the system holds up pretty well.

A graph of the number of lock acquire requests in ZooLocker per second

My summer internship at Flickr has been an incredibly valuable experience for me. I have demystified the process of writing, testing, and integrating code into a running system that millions of people around the world use each and every day. I have also learned about the amazing work going on in the engineering team, the ups and downs the code deploy process, and how to dodge the incoming flying finger rockets that the Flickr team members fling at each other.  My internship at Flickr is an experience I will never forget, and I am very grateful to the entire Flickr team for giving me the opportunity to work and learn from them this summer.

Redis Global Locks Redux

In my last post I described how we use Redis to manage a global lock that allows us to automatically failover to a backup process if there was a problem in the primary process. The method described allegedly allowed for any number of backup processes to work in conjunction to pick up on primary failures and take over processing.

Locks #1
Locks #1 by Christoph Kummer

Thanks to an astute reader, it was pointed out that the code in the blog wouldn’t actually work as advertised:


The Problem

Nolan correctly noticed that when the backup processes attempts to acquire the lock via SETNX, that lock key will already exist from when it was acquired by the primary, and thus all subsequent attempts to acquire locks will simply end up constantly trying to acquire a lock that can never be acquired. As a reminder, here’s what we do when we check back on the status of a lock:

function checkLock(payload, lockIdentifier) {
    client.get(lockIdentifier, function(error, data) {
        // Error handling elided for brevity
        if (data !== DONE_VALUE) {
            acquireLock(payload, data + 1, lockCallback);
        } else {

And here’s the relevant bit from acquireLock that calls SETNX:

    client.setnx(lockIdentifier, attempt, function(error, data) {
        if (error) {
            logger.error(&quot;Error trying to acquire redis lock for: %s&quot;, lockIdentifier);
            return callback(error, dataForCallback(false));

        return callback(null, dataForCallback(data === 1));

So, you’re thinking, how could this vaunted failover process ever actually work? The answer is simple: the code from that post isn’t what we actually run. The actual production code has a single backup process, so it doesn’t try to re-acquire the lock in the event of failure, it just skips right to trying to send the message itself. In the previous post, I described a more general solution that would work for any number of backup processes, but I missed this one important detail.

That being said, with some relatively minor changes, it’s absolutely possible to support an arbitrary number of backup processes and still maintain the use of the global lock. The trivial solution is to simply have the backup process delete the key before trying to re-acquire the lock (or, technically acquire it anew). However, the problem with that becomes apparent pretty quickly. If there are multiple backup processes all deleting the lock and trying to SETNX a new lock again, there’s a good chance that a race condition could arise wherein one of backups deletes a lock that was acquired by another backup process, rather than the failed lock from the primary.

The Solution

Thankfully, Redis has a solution to help us out here: transactions. By using a combination of WATCH, MULTI, and EXEC, we can perform actions on the lock key and be confident that no one has modified it before our actions can complete. The process to acquire a lock remains the same: many processes will issue a SETNX and only one will win. The changes come into play when the processes that didn’t acquire the lock check back on its status. Whereas before, we simply checked the current value of the lock key, now we must go through the above described Redis transaction process. First we watch the key, then we do what amounts to a check and set (albeit with a few different actions to perform based on the outcome of the check):

function checkLock(payload, lockIdentifier, lastCount) {;
        .exec(function(error, replies) {
            if (!replies) {
                // Lock value changed while we were checking it, someone else got the lock
                client.get(lockIdentifier, function(error, newCount) {
                    setTimeout(checkLock, LOCK_EXPIRY, payload, lockIdentifier, newCount);


            var currentCount = replies[0];
            if (currentCount === null) {
                // No lock means someone else completed the work while we were checking on its status and the key has already been deleted
            } else if (currentCount === DONE_VALUE) {
                // Another process completed the work, let’s delete the lock key
            } else if (currentCount == lastCount) {
                // Key still exists, and no one has incremented the lock count, let’s try to reacquire the lock
                reacquireLock(payload, lockIdentifier, currentCount, doWork);
            } else {
                // Key still exists, but the value does not match what we expected, someone else has reacquired the lock, check back later to see how they fared
                setTimeout(checkLock, LOCK_EXPIRY, payload, lockIdentifier, currentCount);

As you can see, there are five basic cases we need to deal with after we get the value of the lock key:

  1. If we got a null reply back from Redis, that means that something else changed the value of our key, and our exec was aborted; i.e. someone else got the lock and changed its value before we could do anything. We just treat it as a failure to acquire the lock and check back again later.
  2. If we get back a reply from Redis, but the value for the key is null, that means that the work was actually completed and the key was deleted before we could do anything. In this case there’s nothing for us to do at all, so we can stop right away.
  3. If we get back a value for the lock key that is equal to our sentinel value, then someone else completed the work, but it’s up to us to clean up the lock key, so we issue a Redis DEL and call our job done.
  4. Here’s where things get interesting: if the key still exists, and its value (the number of attempts that have been made) is equal to our last attempt count, then we should try and reacquire the lock.
  5. The last scenario is where the key exists but its value (again, the number of attempts that have been made) does not equal our last attempt count. In this case, someone else has already tried to reacquire the lock and failed. We treat this as a failure to acquire the lock and schedule a timeout to check back later to see how whoever did acquire the lock got on. The appropriate action here is debatable. Depending on how long your underlying work takes, it may be better to actually try and reacquire the lock here as well, since whoever acquired the lock may have already failed. This can, however, lead to premature exhaustion of your attempt allotment, so to be safe, we just wait.

So, we’ve checked on our lock, and, since the previous process with the lock failed to complete its work, it’s time to actually try and reacquire the lock. The process in this case is similar to the above inasmuch as we must use Redis transactions to manage the reacquisition process, thankfully however, the steps are (somewhat) simpler:

function reacquireLock(payload, lockIdentifier, attemptCount, callback) {;
    client.get(lockIdentifier, function(error, data) {
        if (!data) {
            // Lock is gone, someone else completed the work and deleted the lock, nothing to do here, stop watching and carry on

        var attempts = parseInt(data, 10) + 1;

        if (attempts &gt; MAX_ATTEMPTS) {
            // Our allotment has been exceeded by another process, unwatch and expire the key
            client.expire(lockIdentifier, ((LOCK_EXPIRY / 1000) * 2));

            .set(lockIdentifier, attempts)
            .exec(function(error, replies) {
                if (!replies) {
                    // The value changed out from under us, we didn't get the lock!
                    client.get(lockIdentifier, function(error, currentAttemptCount) {
                        setTimeout(checkLock, LOCK_TIMEOUT, payload, lockIdentifier, currentAttemptCount);
                } else {
                    // Hooray, we acquired the lock!
                    callback(null, {
                        &quot;acquired&quot; : true,
                        &quot;lockIdentifier&quot; : lockIdentifier,
                        &quot;payload&quot; : payload

As with checkLock we start out by watching the lock key, and proceed do a (comparitively) simplified check and set. In this case, we’ve “only” got three scenarios to deal with:

  1. If we’ve already exceeded our allotment of attempts, it’s time to give up. In this case, the allotment was actually exceeded in another worker, so we can just stop right away. We make sure to unwatch the key, and set it expire at some point far enough in the future that any remaining processes attempting to acquire locks will also see that it’s time to give up.

Assuming we’re still good to keep working, we try and update the lock key within a MULTI/EXEC block, where we have our remaining two scenarios:

  1. If we get no replies back, that again means that something changed the value of the lock key during our transaction and the EXEC was aborted. Since we failed to acquire the lock we just check back later to see what happened to whoever did acquire the lock.
  2. The last scenario is the one in which we managed to acquire the lock. In this case we just go ahead and do our work and hopefully complete it!


To make managing global locks even easier, I’ve gone ahead and generalized all the code mentioned in both this and the previous post on the subject into a tidy little event based npm package: Here’s a quick snippet of how to implement global locks using this new package:

var RedisLockingWorker = require(&quot;redis-locking-worker”);

var SUCCESS_CHANCE = 0.15;

var lock = new RedisLockingWorker({
    &quot;lockKey&quot; : &quot;mylock&quot;,
    &quot;statusLevel&quot; : RedisLockingWorker.StatusLevels.Verbose,
    &quot;lockTimeout&quot; : 5000,
    &quot;maxAttempts&quot; : 5

lock.on(&quot;acquired&quot;, function(lastAttempt) {
    if (Math.random() &lt;= SUCCESS_CHANCE) {
        console.log(&quot;Completed work successfully!&quot;, lastAttempt);
    } else {
        // oh no, we failed to do work!
        console.log(&quot;Failed to do work&quot;);

There’s also a few other events you can use to track the lock status:

lock.on(&quot;locked&quot;, function() {
    console.log(&quot;Did not acquire lock, someone beat us to it&quot;);

lock.on(&quot;error&quot;, function(error) {
    console.error(&quot;Error from lock: %j&quot;, error);

lock.on(&quot;status&quot;, function(message) {
    console.log(&quot;Status message from lock: %s&quot;, message);

More Bonus!

If you don’t need the added complexity if multiple backup processes, I also want to give credit to npm user pokehanai who took the methodology described in the original post and created a generalized version of the two-worker solution:

Wrapping Up

So there you have it! Coordinating work on any number of processes across any number of hosts couldn’t be easier! If you have any questions or comments on this, please feel free to follow up on Twitter.

Flickr flamily floto

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring engineers, designers and product managers in our San Francisco office. Find out more at