The Data Freshener

Hello Kitty car air freshener
So fresh

 

Change

You may have noticed some changes in Flickr a couple months back. Like, half the site changed. 95% even, by some metrics. Some say CHANGE IT BACK! while others welcome change. Whatever your thoughts, the changes are here, and they mean things. For example, they mean new visual design and better usability. They mean a faster site. Unfortunately, up until recently, they also meant more stale data. Yuck.

Change
Change
 

Why? What? Well…here’s the deal. We have a new-ish frontend stack we’ve been using for the past couple years now. It’s an isomorphic single-page application, runs on node.js, and is generally awesome. We call it Reboot.

hi there / i am the computer
Reboot
 

In the World of Reboot, we treat data with kid gloves. We <3 data. We never want to give it up, never want to let it down. Once we pull data from our APIs, we store the fetched data in your browser so that we don’t have to fetch it again the next time it’s needed. This means faster page loads and faster navigation, and less API traffic (and thus a more stable and scalable API). The data cached in your browser exists as long as the current Reboot session — until you refresh or leave Reboot for a non-Rebooted page.

However, this also meant that data could become stale. You change the date taken of your photo, someone else adds a comment, you navigate to a page with cached data…and you don’t see the changes. Wat? Yeah. So, this was not a huge problem until we moved lots of pages onto Reboot in the beginning of May. From that point forward, most Flickr user sessions have spent their entirety on Reboot, feeding off the same stale loaves of cached data.

bread
Staleness
 

The thinking (design / prototypes)

We considered a number of possibilities for freshening up data during a user session. A brief history of the strategies we sampled, and their results:

1. Refresh on update

Ice Tea
 

The first stab focused on updating data locally after it was changed by the user. Most of our simpler use cases already updated as expected, but some trickier cases with indirect relationships did not. For example, changing the date taken of a photo updated the data model for the photo, but deleting a photo did not necessarily ensure the photo was removed from all the cached albums, groups, and galleries to which it belonged. (Note that the photo was removed correctly from the backend, just not from the cached representation of those entities on the client.)

Cleaning up these relationships using change events between models helped, but didn’t solve all our problems. When someone outside of the local session (read: another user) changed data, it would not reflect in the current session. The only way to catch changes from outside the current session was to be more aggressive about evicting models.

2. Nuclear option

Atomic Bomb Test
 

The pendulum swung all the way in the other direction — instead of surgical removal of data models we knew to be out-of-date, what would happen if we removed all cached data on every navigation? This prototype was quick to build, and incredibly destructive. By doing this, all our cached data always remained as fresh as could be, but we essentially reverted to Web 1.0 — with the exception of the Reboot framework, everything was reloaded on every page.

Not surprisingly, this blew up API traffic (locally only! did not unleash that disaster at scale), and inflated page load times like a Jeff Koons sculpture. It did give us some baseline timing metrics we could point to as worst-case scenarios, however. The next step was to swing the pendulum back toward the middle — to a carefully-knitted solution that would preserve fast page loads and navigation, while ensuring the freshest data we could serve up.

3. Refetch on navigate

fetched
 

At this point, our challenge was to find a solution that would keep navigation fast, API traffic slim, and pick up all changes to session data, whether local or remote. We ended up with a solution we call “refetching”: evicting and requesting new data models as the model is needed by the application. But when?
We could refetch periodically or on a user action; we determined that the best time to trigger a refetch was on navigation — when the user navigates, cached models become eligible for refetching. Specifically, when the user navigates between sections of the site, refetching is triggered. This proved to be the happiest medium between speed and freshness.

A high-level outline of how the refetching strategy works:

  • The user loads a page; data are requested from the API, and models are cached. As new models are created, they’re marked as being fresh.
  • The user navigates to another site section (e.g. Photostream → Search); all freshness marks are removed from all models. They’re now all eligible for refetching.
  • As Reboot builds the new page, it requests data models from the cache. Since they no longer have their seal of freshness, they are refetched, and marked as fresh once retrieved and cached.

One important note — refetching is not triggered on browser back/forward navigation. Users expect near-immediate navigation, thanks to browser caching, when navigating to already-viewed content. Therefore, we refetch only when the user clicks a link to navigate to a new site section.

4. Miscellany

There were a couple other options we considered and rejected from the start, but they’re worth mentioning here.

One was a TTL (time-to-live) algorithm, commonly used in caching applications. TTL algorithms expire data and evict from the cache a certain amount of time after they’re written or last updated. The arbitrary nature of TTL would mean that users would sometimes have fresh data and sometimes stale; it would be fresh more often than without any solution, but freshness would vary arbitrarily and would not result in much of an improvement on user experience.

The other was to write an algorithm that tracks the amount of time since a data model was last accessed, and refetch when it grows too old. While this sounded interesting at first, it has the same flaw as a standard TTL algorithm — freshness becomes arbitrary. It’s also more complex to implement, and might end up not being worth the complexity.

The doing (implementation)

So that was it! Refetch on navigate, all done. Right?….of course not. With the general strategy in place, the devil started sneaking around in all the details. Some of the highlights:

Exemptions

It proved to be not the best idea to evict on all navigation. For example, in Reboot we often preload photo metadata models on pages with lists of photos, in order to make navigation into the photo page snappy. The refetch setup therefore has an exemption config that allows us to easily retain models when navigating into, away from, or between specific site sections.

Child models

We often have parent-child associations between data models. For example, the data model for a photo has a reference to a data model for the author of the photo. When the photo model is refetched, the person model must be refetched as well. This means the function doing the eviction and refetching has to recurse through all child models.

Collections

An issue similar to child models above, but more complex, is the case of a model containing a list of other models. For example, the data model for a person’s photostream contains a list of photo models.

What made this particularly tricky is pagination and filtering — say you load the first 2 pages of your photostream, set your view filter to private, jump to page 5, switch the view to “Date taken”, and navigate away and back to your photostream…imagine the mess of different models with partially-loaded collections. Evicting one parent model, and its children, might evict photo models from the collection within another, without properly refetching. The solution here actually lay in the controller responsible for fetching pages: if a requested page of models is not already completely in-cache, a refetch will always happen to ensure we have all the data, in its freshest state.

Refetch only once per page view

Critical to the refetch-on-navigation strategy is to refetch only once per navigation. This was not too difficult, but essential to get right. We accomplish this by adding a flag when a model is initially fetched and upserted into the cache. When navigating to a new, non-exempt site section, all those flags are cleared, and any model requested by the new page will be refetched. When refetched, the model is again upserted into the cache and marked as fresh, until the next navigation.

But did it fresh?

Go on without me
 

With the thinking and the doing out of the way, it was time to push all this to production. Because these changes are essentially pulling the rug out from underneath the data layer on every navigation, we had to tread very carefully in order to prevent any negative impact to the end user experience.

We did very thorough manual and automated testing across all of Reboot. We left the feature turned on for staff users for a while, to be able to respond to any bug reports. Finally, the time came to test on Real People. There were three things we needed to keep an eye on: errors (of course), impact on page navigation timing, and API traffic. Since refetching implies more requests for data, we needed to be sure that we were keeping the user experience smooth and fast, and also that we weren’t blowing up our data centers.

 
All In
All in
 

In order to get a good read on these things, though, we had to go all in. Letting in just a small percentage of users would not give reliable numbers for timing or traffic impacts, due to the noise inherent in relatively small sample sizes. So, we did something unusual: we turned on refetching for all users for a short period of time. We flipped on refetching and kept an eagle eye on our stats for 2 hours, then reverted; then, we took a careful look at the aggregated data to see how the experiment went.

Surprisingly, the impact on both timing and traffic was relatively low. After some thought, we decided this is most likely because the changes disproportionately impact people on long sessions, say a Flickr tab open for hours or days. Most people don’t hang around that long; they come, they go. Also, the photo page represents north of 90% of our page views, and is exempt from refetching (see Exemptions above).

So where did we end up? A negligible bump in navigation timing and API traffic, and fresher data for all. Perhaps an anticlimactic resolution, but the story we’ve heard today outlines a serious consideration for anyone building an application with a data caching layer: keep in mind from the beginning how you plan to deal with stale data, but in a way that keeps all the other benefits of a single-page application.

#CCC is a breadcat
Busting through staleness. Yep.

Much Photos!

Introducing the New! Shiny! Photolist framework

photolist-sky
Blue skies. Mostly.

Here at Flickr, we have photos. Lots of photos. Like, billions and billions of photos. So, it’s pretty important for us to be able to show you more than one at once.

We have used what we call the “justified algorithm” to lay out photos for a while now, but as we move more and more pages onto our new-ish isomorphic node.js stack, we determined it was time to revisit the algorithm and create an updated implementation.

A few of us here in Frontend-landia got together to figure out all the things this new shiny should be able to do. With a lot of projects in full swing and on the near horizon, we came up with a pretty significant list, including but not limited to:
– Easy for developers to use
– Fit into any kind of container
– Support pagination (in both directions!) and infinite scroll
– Jank-free, butter-silky-baby-smooth scrolling
– Support layouts other than justified, like square thumbnails and grid layout with native aspect ratio

After some brainstorming, drawing of diagrams, and gummi bear consumption, we got to work building out the framework and the underlying algorithm.

Rejustification

photolist-sky
Drawing of diagrams

The basics of the justified algorithm aren’t too complex. The goal is for the layout module to accept a list of photo aspect ratios, and return a list of rectangles. A layout consists of a number of rows of items (photos), each with a target height and allowable height deviation above and below. This, along with the container width, gives us a minimum and maximum row aspect ratio.

Photolist: variable row height
Fig. 1: The justified algorithm: dimensions

We push each photo into a row; once the row is filled up, we move on to the next. It goes a little something like this:

  1. Iterate over each photo in the list to display
  2. Create a new row if there’s not currently an open row
  3. Attempt to add the photo to the current row at its native aspect ratio and at the target row height
  4. If the new row aspect ratio is less than the minimum row aspect ratio, continue adding photos until the aspect ratio is greater than the maximum aspect ratio
  5. Either keep or drop the last added photo, depending on which generates a row aspect ratio closer to the target row aspect ratio; adjust the row height as needed, and seal the row
  6. Repeat until all the photos have been laid out.

Photolist: row filling
Fig. 2: The justified algorithm: row filling

It’s Never That Easy…

The justified algorithm described above is the primary responsibility of the layout module. In practice, however, there are a number of other things the layout must handle to get good results for all use cases, and to communicate the results to other parts of the framework.

Diffs

One key feature of the layout module is how it organizes its results. To minimize the amount of processing required to update photos as the layout changes, the layout module returns pre-sorted diffs, each with a specific purpose:

  • new items, used to create new photos and put them in place
  • layout-changed items, used to resize/reposition existing photos
  • visibility-changed items, used to wake/sleep existing photos
  • widows and orphans (leading and trailing items) (read on!)

The container view can process only the parts of the layout response that are necessary, given the current state of the whole framework, to keep processing time down and keep performance up.

Widows and orphans


Annie the Musical,
Annie the Musical, by Eva Rinaldi

Some photolist pages on Flickr use infinite scrolling, and some display results one page at a time. Regardless of how a page shows its photos, it starts to feel messy when there is an incomplete row of photos hanging off the end of the page. If there is more content in the set, the last row should be full. However, since we fetch photos from the API in fixed batch sizes, things don’t always work out so nicely, leaving “leftovers” in the bottom row. Borrowing from typesetting terminology, we call these leftover photos orphans. (We can also paginate backwards; leftovers at the top are technically widows but we’ll just keep using the term orphans for simplicity.)

The layout notes these incomplete rows and hides them from the rest of the framework until the next page of content loads in. (This led to frequent and questionable metaphors about “orphan suppression” and “orphan rehydration.”) When orphans are to be hidden, the layout simply keeps them out of the diff. When the orphans are brought back in as the next page loads, the layout prepends them to the next diff. The container view is none the wiser.

This logic gets even more fun when you consider that it must perform in all of these use cases:

  • fixed page size (book-style) pagination
  • downward-scrolling infinite pagination
  • upward-scrolling infinite pagination (enter into an “infinite” content set somewhere other than the beginning); this requires right-to-left layout!

There’s also the case of the end of an “infinite” content set (scrolling down to the end or up to the beginning); in these cases, we still want the row to appear complete, and still must maintain the native aspect ratio of the photo. Therefore, we allow the row height to grow as much as it needs in this case only.

Bonus Round!

You might have noticed that Flickr is kind of a big site with, like, lots of photos. And we display photos in lots of different ways, with lots of different use cases. The photolist framework bends over backwards to support all of those, including:

  • forward and backward pagination
  • infinite scroll and fixed page-size pagination
  • specific aspect ratios (e.g. squares)
  • fixed number of rows
  • fast relayout (only a few milliseconds for thousands of photos)

Going into detail on each of those features is way beyond the scope of this blog post, but suffice it to say the framework is built to handle just about anything Flickr can throw at it. The one exception is the upcoming Camera Roll (coming soon to those of you who don’t yet have it!), which is Too Extreme for this framework, so we devised something special just for that page.

The whole enchilada


Mmm... enchiladas,
Mmm… enchiladas, by jeffreyww

The layout is at the heart of the photolist framework, but wait — there’s more! The main components of the framework are the layout (dissected above), the container view / controller, and the subviews (usually containing photos).

Photolist: the whole enchilada
Fig. 3: Relationship of view/controller, layout, and subviews, and changing subview states during downward scroll

The container view does a lot of fancy things, like:

  • loading in photos as you scroll down or paginate
  • triggering a relayout when the container size changes (i.e. when you resize your window)
  • matching up server-rendered HTML with clientside JavaScript objects (see isomorphic JavaScript, and an upcoming blog post about the Hermes stack at Flickr).

Its primary job, though, is to act as the conduit between the layout module and the individual subviews.

Every time a layout is processed or changes, it returns a “layout response” to the container view. The layout response contains a list of rectangles and wake/sleep flags (actually, a list of lists; see Diffs above); the container view relays that new information on to each individual subview to determine position and visibility. The container view doesn’t even need to know about the layout details — each subview adjusts itself to its layout data all on its own.

The subviews each have a decent amount of intelligence of their own, performing such tasks as:

  • choosing the most appropriate photo file size to fit the layout rectangle
  • adding/removing itself to/from the DOM as instructed by the layout to maintain good scroll performance
  • providing an annotation and interaction layer for titles, faves, comments, etc.

Coming soon to a webpage near you

The new photolist framework is certainly not a one-size-fits-all solution; it’s tailored for Flickr’s specific use cases. However, we tried to design and build it to be as broadly useful for Flickr as possible; as we continue to move parts of the site onto the new frontend stack and innovate new features, it’s critical to have solid components upon which we can build Flickr’s future. The layout algorithm is probably useful for many applications though, and we hope you gained some insight into how you might implement your own.

The photolist framework is already live in a number of places on the site, including the new Unified Search pages (currently in Beta), the Create / Wall art pages, the Group pool preview, and is coming soon to a number of other pages.

As always, if you’re interested in helping with that “and more” part, we’d love to have you! Stop by our jobs page and drop us a line.