code.flickr.com

The Data Freshener


So fresh

 

Change

You may have noticed some changes in Flickr a couple months back. Like, half the site changed. 95% even, by some metrics. Some say CHANGE IT BACK! while others welcome change. Whatever your thoughts, the changes are here, and they mean things. For example, they mean new visual design and better usability. They mean a faster site. Unfortunately, up until recently, they also meant more stale data. Yuck.


Change
 

Why? What? Well…here’s the deal. We have a new-ish frontend stack we’ve been using for the past couple years now. It’s an isomorphic single-page application, runs on node.js, and is generally awesome. We call it Reboot.


Reboot
 

In the World of Reboot, we treat data with kid gloves. We <3 data. We never want to give it up, never want to let it down. Once we pull data from our APIs, we store the fetched data in your browser so that we don’t have to fetch it again the next time it’s needed. This means faster page loads and faster navigation, and less API traffic (and thus a more stable and scalable API). The data cached in your browser exists as long as the current Reboot session — until you refresh or leave Reboot for a non-Rebooted page.

However, this also meant that data could become stale. You change the date taken of your photo, someone else adds a comment, you navigate to a page with cached data…and you don’t see the changes. Wat? Yeah. So, this was not a huge problem until we moved lots of pages onto Reboot in the beginning of May. From that point forward, most Flickr user sessions have spent their entirety on Reboot, feeding off the same stale loaves of cached data.


Staleness
 

The thinking (design / prototypes)

We considered a number of possibilities for freshening up data during a user session. A brief history of the strategies we sampled, and their results:

1. Refresh on update


 

The first stab focused on updating data locally after it was changed by the user. Most of our simpler use cases already updated as expected, but some trickier cases with indirect relationships did not. For example, changing the date taken of a photo updated the data model for the photo, but deleting a photo did not necessarily ensure the photo was removed from all the cached albums, groups, and galleries to which it belonged. (Note that the photo was removed correctly from the backend, just not from the cached representation of those entities on the client.)

Cleaning up these relationships using change events between models helped, but didn’t solve all our problems. When someone outside of the local session (read: another user) changed data, it would not reflect in the current session. The only way to catch changes from outside the current session was to be more aggressive about evicting models.

2. Nuclear option


 

The pendulum swung all the way in the other direction — instead of surgical removal of data models we knew to be out-of-date, what would happen if we removed all cached data on every navigation? This prototype was quick to build, and incredibly destructive. By doing this, all our cached data always remained as fresh as could be, but we essentially reverted to Web 1.0 — with the exception of the Reboot framework, everything was reloaded on every page.

Not surprisingly, this blew up API traffic (locally only! did not unleash that disaster at scale), and inflated page load times like a Jeff Koons sculpture. It did give us some baseline timing metrics we could point to as worst-case scenarios, however. The next step was to swing the pendulum back toward the middle — to a carefully-knitted solution that would preserve fast page loads and navigation, while ensuring the freshest data we could serve up.

3. Refetch on navigate


 

At this point, our challenge was to find a solution that would keep navigation fast, API traffic slim, and pick up all changes to session data, whether local or remote. We ended up with a solution we call “refetching”: evicting and requesting new data models as the model is needed by the application. But when?
We could refetch periodically or on a user action; we determined that the best time to trigger a refetch was on navigation — when the user navigates, cached models become eligible for refetching. Specifically, when the user navigates between sections of the site, refetching is triggered. This proved to be the happiest medium between speed and freshness.

A high-level outline of how the refetching strategy works:

One important note — refetching is not triggered on browser back/forward navigation. Users expect near-immediate navigation, thanks to browser caching, when navigating to already-viewed content. Therefore, we refetch only when the user clicks a link to navigate to a new site section.

4. Miscellany

There were a couple other options we considered and rejected from the start, but they’re worth mentioning here.

One was a TTL (time-to-live) algorithm, commonly used in caching applications. TTL algorithms expire data and evict from the cache a certain amount of time after they’re written or last updated. The arbitrary nature of TTL would mean that users would sometimes have fresh data and sometimes stale; it would be fresh more often than without any solution, but freshness would vary arbitrarily and would not result in much of an improvement on user experience.

The other was to write an algorithm that tracks the amount of time since a data model was last accessed, and refetch when it grows too old. While this sounded interesting at first, it has the same flaw as a standard TTL algorithm — freshness becomes arbitrary. It’s also more complex to implement, and might end up not being worth the complexity.

The doing (implementation)

So that was it! Refetch on navigate, all done. Right?….of course not. With the general strategy in place, the devil started sneaking around in all the details. Some of the highlights:

Exemptions

It proved to be not the best idea to evict on all navigation. For example, in Reboot we often preload photo metadata models on pages with lists of photos, in order to make navigation into the photo page snappy. The refetch setup therefore has an exemption config that allows us to easily retain models when navigating into, away from, or between specific site sections.

Child models

We often have parent-child associations between data models. For example, the data model for a photo has a reference to a data model for the author of the photo. When the photo model is refetched, the person model must be refetched as well. This means the function doing the eviction and refetching has to recurse through all child models.

Collections

An issue similar to child models above, but more complex, is the case of a model containing a list of other models. For example, the data model for a person’s photostream contains a list of photo models.

What made this particularly tricky is pagination and filtering — say you load the first 2 pages of your photostream, set your view filter to private, jump to page 5, switch the view to “Date taken”, and navigate away and back to your photostream…imagine the mess of different models with partially-loaded collections. Evicting one parent model, and its children, might evict photo models from the collection within another, without properly refetching. The solution here actually lay in the controller responsible for fetching pages: if a requested page of models is not already completely in-cache, a refetch will always happen to ensure we have all the data, in its freshest state.

Refetch only once per page view

Critical to the refetch-on-navigation strategy is to refetch only once per navigation. This was not too difficult, but essential to get right. We accomplish this by adding a flag when a model is initially fetched and upserted into the cache. When navigating to a new, non-exempt site section, all those flags are cleared, and any model requested by the new page will be refetched. When refetched, the model is again upserted into the cache and marked as fresh, until the next navigation.

But did it fresh?


 

With the thinking and the doing out of the way, it was time to push all this to production. Because these changes are essentially pulling the rug out from underneath the data layer on every navigation, we had to tread very carefully in order to prevent any negative impact to the end user experience.

We did very thorough manual and automated testing across all of Reboot. We left the feature turned on for staff users for a while, to be able to respond to any bug reports. Finally, the time came to test on Real People. There were three things we needed to keep an eye on: errors (of course), impact on page navigation timing, and API traffic. Since refetching implies more requests for data, we needed to be sure that we were keeping the user experience smooth and fast, and also that we weren’t blowing up our data centers.

 

All in
 

In order to get a good read on these things, though, we had to go all in. Letting in just a small percentage of users would not give reliable numbers for timing or traffic impacts, due to the noise inherent in relatively small sample sizes. So, we did something unusual: we turned on refetching for all users for a short period of time. We flipped on refetching and kept an eagle eye on our stats for 2 hours, then reverted; then, we took a careful look at the aggregated data to see how the experiment went.

Surprisingly, the impact on both timing and traffic was relatively low. After some thought, we decided this is most likely because the changes disproportionately impact people on long sessions, say a Flickr tab open for hours or days. Most people don’t hang around that long; they come, they go. Also, the photo page represents north of 90% of our page views, and is exempt from refetching (see Exemptions above).

So where did we end up? A negligible bump in navigation timing and API traffic, and fresher data for all. Perhaps an anticlimactic resolution, but the story we’ve heard today outlines a serious consideration for anyone building an application with a data caching layer: keep in mind from the beginning how you plan to deal with stale data, but in a way that keeps all the other benefits of a single-page application.


Busting through staleness. Yep.