Perceptual Image Compression at Flickr

Archie Russell, Peter Norby, Saeideh Bakhshi

At Flickr our users really care about image quality.  They also care a lot about how responsive our apps are.  Addressing both of these concerns simultaneously is challenging;  higher quality images have larger file sizes and are slower to transfer.   Slow transfers are especially noticeable on mobile devices.   Flickr had historically aimed for high quality at the expense of larger files, but in late 2014 we implemented a method to both maintain image quality and decrease the byte-size of the images we serve to users.   As image appearance is very important to our users,  we performed an extensive user test before rolling this change out.   Here’s how we did it.

Background:  JPEG Quality Settings

Fig 1.    JPEG settings vs file size for a test image.

JPEG compression has several tuneable knobs.   The q-value is the best known of these; it adjusts the level of spatial detail stored for fine details;  a higher q-value typically keeps more detail.    However,  as q-value gets very close to 100,  file size increases dramatically,  usually without improving image appearance.

If file size and app performance isn’t an issue,  dialing up q-value is an easy way to get really nice-looking images; this is what Flickr has done in the past.    And if appearance isn’t very important,  dialing down q-value is a viable option.    But if you want both,  you’re kind of stuck.   Additionally,  q-value isn’t one-size-fits-all,  some images look great at q-value 80 while others don’t.

Another commonly adjusted setting is chroma-subsampling,  which alters the amount of color information stored in a JPEG file.    With a setting of 4:4:4,  the two chroma (color) channels in a JPG have as much information as the luminance channel.   In an image with a setting of 4:2:0, each chroma channel has only a quarter as much information as in an a 4:4:4 image.

 q=96,  chroma=4:4:4 (125KB) q=70, chroma=4:4:4 (67KB)
q=96, chroma=4:2:0 (62KB)  q=70, chroma=4:2:0 (62KB)

Table 1:   JPEG stored at different quality and chroma levels.   The upper left image is saved at high quality and chroma level; notice the color and detail in the folds of the red flag.   The lower right image has the lowest quality;  notice artifacts along the right edges of the red flag.

Perceptual JPEG Compression

Ideally we’d have an algorithm which automatically tuned all JPEG parameters to make a file smaller, but which would limit perceptible changes to the image.  Technology exists that attempts to do this and can decrease image file size by 30-50%. This compression ratio is highly dependent on image content and dimensions.

compressed: 112KB non-compressed: 224KB

Fig 2. Compressed cropped JPEG is 50% smaller than not-compressed cropped JPEG, above, with no obvious defects.  Compression ratio is similar for a compressed 2048-pixel wide JPEG (475KB) of the entire scene and its corresponding not-compressed JPEG (897KB). 

We were pleased with perceptually compressed images in non-structured examinations.  The compressed images were smaller and nearly indistinguishable from their sources.   But we wanted to really quantify how well the technology worked before considering incorporating it into Flickr.  The standard computational tools for evaluating compression, such as SSIM, are fairly simplistic and don’t do a great job at modeling how a user sees things.  To really evaluate this technology had to use a better measure of perceptibility:  human minds.

The Gamified Taste Test

To test whether our image compression would impact user perception of image quality, we put together a “taste test.”  The taste test is constructed as a game with multiple rounds where users look at both compressed and uncompressed images.  Users accumulate points the longer they play, and get more points for doing well at the game.  We maintained a leaderboard to encourage participation and used only internal testers.The game’s test images came from a diverse collection of 250 images contributed by Flickr staff.  The images came from a variety of cameras and included a number of subjects from photographers with varying skill levels.

sampling of images used in taste test
Fig 3. A sampling of images used in our taste test.

In each round, our test code randomly select a test image, and present two variants of this image side by side.  50% of the time we present the user two identical images; the rest of the time we present one compressed image and one uncompressed image.  We ask the tester if the two images look the same or different and we’d expect a user choosing randomly OR a user unable to distinguish the two cases would answer correctly about half the time.  We randomly swap the location of the compressed images to compensate for user bias to the left or the right.  If testers choose correctly, they are presented with a second question: “Which image did you prefer, and why?”

two kittens in a video game
Fig 4. Screenshot of taste test.

Our test displays images simultaneously to prevent testers noticing a longer load time for the larger, non-compressed image.  The images are presented with either 320, 640, or 1600 pixels on their longest side.  The 320 & 640px images are shown for 12 seconds before being dimmed out.  The intent behind this detail is to represent how real users interact with our images.  The 1600px images stay on screen for 20 seconds, as we expect larger images to be viewed for longer periods of time by real users.   We award 100 points per round, regardless of whether a tester chose correctly and also award a bonus of 400 points when a tester correctly identifies whether images were identical or different.  We update the tester’s score every five tests so that the user perceives an increasing score without being rewarded immediately for any particular behavior.

Taste Test Outcome and Deployment

We ran our taste test for two weeks and analyzed our results.    Although we let users play as long as they liked,  we skipped the first result per user as a “warm-up” and considered only the subsequent ten results,  this limited the potential for users training themselves to spot compression artifacts.   We disregarded users that had fewer than eleven results.

images total results # labeled “identical” by tester % labeled “identical” by tester
two identical images 368 253 68.8%
one compressed, one non-compressed 352 238 67.6%

Table 2.   Taste test results.   Testers select “identical” at nearly the same rate, whether the input is identical or not.

When our testers were presented with two identical images, they thought the images were identical only 68.8% of the time(!), and when presented with a compressed image next to a non-compressed image,  our testers thought the images were identical slightly less often:  67.6% of the time.  This difference was small enough for us,  and our statisticians told us it was statistically insignificant.  Our image pairs were so similar that multiple testers thought all images were identical and reported that the test system was buggy. We inspected the images most often labeled different, and found no significant artifacts in the compressed versions.

So even in this side-by-side test,  perceptual image compression is just barely noticeable when images are presented side-by-side.  As the Flickr website wouldn’t ever show compressed and uncompressed images at the same time, and the use of compression had large benefits in storage footprint and site performance, we elected to go forward.

At the beginning of 2014 we silently rolled out perceptual-based compression on our image thumbnails (we don’t alter the “original” images uploaded by our users).  The slight changes to image appearance went unnoticed by users, but user interactions with Flickr became much faster,  especially for users with slow connections, while our storage footprint became much smaller.  This was a best-case scenario for us.

Evaluating perceptual compression was a considerable task,  but it gave the confidence we needed to apply this compression in production to our users.    This marked the first time Flickr had adjusted image settings in years, and, it was fun.
High Score List
Fig 5.  Taste test high score list

Epilogue

After eighteen months of perceptual compression at Flickr,  we adjusted our settings slightly to shrink images an additional 15%.   For our users on mobile devices,  15% fewer bytes per image makes for a much more responsive experience.We had run a taste test on this newer setting and users were were able to spot our compression slightly more often than with our original settings.   When presented a pair of identical images, our testers declared these images identical 65.2% of the time,  when presented with different images,  of our testers declared the images identical 62% of the time.   It wasn’t as imperceptible as our original approach, but, we decided it was close enough to roll out.

Boy were we wrong!   A few very vocal users spotted the compression and didn’t like it at all.    The Flickr Help Forum had a very lively thread which Petapixel picked up.  We beat our heads against the wall considered our options and came up with a middle path between our initial and follow-on approaches,  giving us smaller, faster-to-load files while still maintaining the appearance our users expect.

Through our use of perceptual compression,  combined with our use of on-the-fly resize and COS,  we’ve been able to decrease our storage footprint dramatically, while simultaneously improving user experience. It’s a win all around but we’re not done yet — we still have a few tricks up our sleeves.

Powering Flickr’s Magic view by fusing bulk and real-time compute

Try it for yourself!

You can try out Flickr’s Magic View on your own photos here, and you can download a working code sample of the simplified lambda architecture here: https://github.com/yahoo/simplified-lambda

Introduction

In this post we’re going to talk about how we came up with a novel revision of the Lambda Architecture for fusing large-scale bulk compute with streaming compute to power Flickr’s Magic View. We were able to create a responsive, real time database operating at a scale of tens of billions of records, with tens to hundreds of millions of records updated per day. We turned to Yahoo’s Hadoop stack to find a way to build this at the massive scale we needed.

Magic View

Figure 1. Magic View in action

Motivation: the Magic View

Flickr’s Magic View takes the hassle out of organizing your photos by applying our computer-vision technology to automatically recognize objects or styles in your photos and present them to you in the Camera Roll’s scrolling view. This all happens in real time as soon as a photo is uploaded, it is categorized and placed into the Magic View.

Aggregating computer vision tags

When a photo is uploaded, it is processed by a computer vision pipeline to generate a set of computer vision tags, which are text labels of the contents of the image. We already had an existing architecture for stream computation of tags on upload, but to implement the Magic View, we needed to maintain per-user reverse indexes and some aggregations of the tags. And we needed to make sure all the data was consistent if a photo was added, removed or updated these indexes and aggregations would have to be updated to reflect this. Finally, we needed to initialize the system with tags for 12 billion photos and videos and run periodic backfills (every time we improved our computer vision algorithms and to cover cases where the stream compute missed images).

The Problem

We initially computed a snapshot of the Magic View indexes and aggregations using map-reduce (via Apache Oozie and Apache Pig), and we were happy with the quick turnaround time (about 7 hours). We considered updating Magic View as a daily batch job, but soon realized this would not give our users the responsive, “live” experience we wanted. So, we built a streaming data layer using Apache Storm and were soon able to update the categories in Magic View in real-time.

The next time we needed to run a backfill, we explored using this streaming layer to load the data. Unfortunately, the overhead of the read-modify-write process was simply too much for a load of this size — after kicking off the process we estimated it would take 28 days this way — much longer than the seven hours we had achieved with a bulk load.

Twenty-eight days was a non-starter – we realized we needed a way to update our bulk aggregations independently of the real-time data streaming in. Solving this problem is how we arrived at our revision to Lambda Architecture. Before digging into the solution, let’s do a quick review of the Lambda Architecture.  If you’re already familiar with it, you can skip this next section.

The Lambda Architecture

We’ll start with Nathan Marz’s book ‘Big Data’, which proposes the database concept of  ‘Lambda Architecture.’ In his analysis, he states that a database query can be represented as a function – Query – which operates on all the data:

result = Query(all data)

In the Lambda architecture, a traditional database is replaced with both a real time and a bulk database. Then query function becomes a “combiner” function of independent queries to each database:

result = Combiner(Query(real time data) + Query(bulk data))

An example of a typical Lambda Architecture is shown in figure 2. It is powered by an append-only queue for its system of record, which is fed by a real time stream of events. Periodically, all the data in the queue is fed into a bulk computation which pre-processes the data to optimize it for queries, and stores these aggregations in a bulk compute database. The real time event stream drives a stream computer, which processes the incoming events into real time aggregations. A query then goes via a query combiner, which queries both the bulk and real time databases, computes the combination, and stores the result.

Typical Lambda Architecture

Figure 2. Typical Lambda Architecture

While relatively new, Lambda Architecture has enjoyed popularity and a number of concrete implementations have been built. Some significant examples are the distributed analytics platform druid, Twitter’s Summingbird, and FiloDB. These implementations conveniently abstract away the databases behind the query combiner.

A significant advantage with this style of architecture is robustness and fault-tolerance via eventual consistency. If a piece of data is skipped in the real time compute there is a guarantee that it will eventually appear in the bulk compute database.

Criticism of the Lambda Architecture has centred around the complicated nature of the combiner. The combiner incurs a developer and systems cost from the need to maintain two different databases. It can be challenging to make sure both systems give the same result. Merging the two queries can become complicated, and finally, more points of failure may be introduced.

The “Ah-ha” Moment

Back to the problem. The data access layer we used for streaming compute uses the atomic read-modify-write pattern to ensure we write consistent data, one record-at-a-time to Apache HBase (a BigTable-style, non-relational database). Again, since this pattern was so much slower in the backfill case we needed to figure out how to get both consistent updates for streaming and  fast loads of the full dataset. Since our bulk data was static, we realized that if we relaxed the consistency constraint we could just run a fast, streaming, write-only load of the bulk data, bringing the load time back down to hours instead of days.

But how could we get around the consistency requirements? We didn’t want a bulk load to clobber data being written from the real time compute process. The insight was that we could just write bulk and streaming data to different column families in the same HBase row. So we added the concept of real time columns and bulk columns in a single row. Basically, bulk loads write to one set of columns and real time writes go to a different set of columns. Since HBase columns are sparse and data is updated relatively slowly we don’t pay much in storage or IO.

We could  now simplify the equation back to:

result = Combiner(Query(data))

The two sets of columns are managed separately by the real time and bulk subsystems. At query time, we perform a single fetch using the HBase API to get both the bulk and real time data. A separate combiner process assembles the final result.

Implementation

 Magic View backend system overview

Figure 3. Magic View Architecture

Figure 3 shows an overview of the system and our enhanced Lambda architecture. For the purposes of this discussion, a convenient abstraction is to consider that each row in the HBase table represents the current state of a given photo. The combiner stage is abstracted into a single Java process, which collects data from HBase and runs transformations on the data and sends it to a Redis cache which is used by the serving layer for the site.

Consistency on read in HBase — the combiner

We have two sets of columns to go with each row in HBase: bulk and real time. The combiner determines the final value for each attribute at read. In the case where data exists for real time but not for bulk (or vice versa) then there is only one value to choose. In the case where they both exist we always choose the real time value. This keeps the combiner very simple and fast.

There is a trick though – whenever we do a backfill, we may need to repair the row since the backfill data may be newer than any real time data that is already present. It turns out this slows down the backfill from seven hours to about 14 — still far faster than loading with read-modify-write.

Production throughput

At scale, this architecture has been able to keep up very comfortably with production load. We can simultaneously run backfills to HBase and serve user information at the same time without impacting latency or the user experience.

User experience

An important measure for how the system works is how the viewer perceives it. The slowest part of the system is paging data from HBase into the serving cache; median time for above-the-fold latency – i.e. enough data is available to render the page – is around 10ms.

Future directions

Our experience has been very positive so far with Magic View and we’re looking at how we might enable users to browse their photos in other dimensions (location or color for example). Early tests have shown that building an OLAP or data cube in this architecture is certainly possible but it’s less clear that it will scale well.

Contributors: Peter Welch, Bhautik Joshi, Hugo Haas, Srinivasan Singanallur, Ayan Ray, Pierre Garrigues, Ben Firestone, Sai Madhavan, Tim Miller

Thanks to Nathan Marz for reviewing this post.

Flickr September 2014

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring mobile, back-end and front-end engineers, in our San Francisco office. Find out more at flickr.com/jobs.

The Data Freshener

Hello Kitty car air freshener
So fresh

 

Change

You may have noticed some changes in Flickr a couple months back. Like, half the site changed. 95% even, by some metrics. Some say CHANGE IT BACK! while others welcome change. Whatever your thoughts, the changes are here, and they mean things. For example, they mean new visual design and better usability. They mean a faster site. Unfortunately, up until recently, they also meant more stale data. Yuck.

Change
Change
 

Why? What? Well…here’s the deal. We have a new-ish frontend stack we’ve been using for the past couple years now. It’s an isomorphic single-page application, runs on node.js, and is generally awesome. We call it Reboot.

hi there / i am the computer
Reboot
 

In the World of Reboot, we treat data with kid gloves. We <3 data. We never want to give it up, never want to let it down. Once we pull data from our APIs, we store the fetched data in your browser so that we don’t have to fetch it again the next time it’s needed. This means faster page loads and faster navigation, and less API traffic (and thus a more stable and scalable API). The data cached in your browser exists as long as the current Reboot session — until you refresh or leave Reboot for a non-Rebooted page.

However, this also meant that data could become stale. You change the date taken of your photo, someone else adds a comment, you navigate to a page with cached data…and you don’t see the changes. Wat? Yeah. So, this was not a huge problem until we moved lots of pages onto Reboot in the beginning of May. From that point forward, most Flickr user sessions have spent their entirety on Reboot, feeding off the same stale loaves of cached data.

bread
Staleness
 

The thinking (design / prototypes)

We considered a number of possibilities for freshening up data during a user session. A brief history of the strategies we sampled, and their results:

1. Refresh on update

Ice Tea
 

The first stab focused on updating data locally after it was changed by the user. Most of our simpler use cases already updated as expected, but some trickier cases with indirect relationships did not. For example, changing the date taken of a photo updated the data model for the photo, but deleting a photo did not necessarily ensure the photo was removed from all the cached albums, groups, and galleries to which it belonged. (Note that the photo was removed correctly from the backend, just not from the cached representation of those entities on the client.)

Cleaning up these relationships using change events between models helped, but didn’t solve all our problems. When someone outside of the local session (read: another user) changed data, it would not reflect in the current session. The only way to catch changes from outside the current session was to be more aggressive about evicting models.

2. Nuclear option

Atomic Bomb Test
 

The pendulum swung all the way in the other direction — instead of surgical removal of data models we knew to be out-of-date, what would happen if we removed all cached data on every navigation? This prototype was quick to build, and incredibly destructive. By doing this, all our cached data always remained as fresh as could be, but we essentially reverted to Web 1.0 — with the exception of the Reboot framework, everything was reloaded on every page.

Not surprisingly, this blew up API traffic (locally only! did not unleash that disaster at scale), and inflated page load times like a Jeff Koons sculpture. It did give us some baseline timing metrics we could point to as worst-case scenarios, however. The next step was to swing the pendulum back toward the middle — to a carefully-knitted solution that would preserve fast page loads and navigation, while ensuring the freshest data we could serve up.

3. Refetch on navigate

fetched
 

At this point, our challenge was to find a solution that would keep navigation fast, API traffic slim, and pick up all changes to session data, whether local or remote. We ended up with a solution we call “refetching”: evicting and requesting new data models as the model is needed by the application. But when?
We could refetch periodically or on a user action; we determined that the best time to trigger a refetch was on navigation — when the user navigates, cached models become eligible for refetching. Specifically, when the user navigates between sections of the site, refetching is triggered. This proved to be the happiest medium between speed and freshness.

A high-level outline of how the refetching strategy works:

  • The user loads a page; data are requested from the API, and models are cached. As new models are created, they’re marked as being fresh.
  • The user navigates to another site section (e.g. Photostream → Search); all freshness marks are removed from all models. They’re now all eligible for refetching.
  • As Reboot builds the new page, it requests data models from the cache. Since they no longer have their seal of freshness, they are refetched, and marked as fresh once retrieved and cached.

One important note — refetching is not triggered on browser back/forward navigation. Users expect near-immediate navigation, thanks to browser caching, when navigating to already-viewed content. Therefore, we refetch only when the user clicks a link to navigate to a new site section.

4. Miscellany

There were a couple other options we considered and rejected from the start, but they’re worth mentioning here.

One was a TTL (time-to-live) algorithm, commonly used in caching applications. TTL algorithms expire data and evict from the cache a certain amount of time after they’re written or last updated. The arbitrary nature of TTL would mean that users would sometimes have fresh data and sometimes stale; it would be fresh more often than without any solution, but freshness would vary arbitrarily and would not result in much of an improvement on user experience.

The other was to write an algorithm that tracks the amount of time since a data model was last accessed, and refetch when it grows too old. While this sounded interesting at first, it has the same flaw as a standard TTL algorithm — freshness becomes arbitrary. It’s also more complex to implement, and might end up not being worth the complexity.

The doing (implementation)

So that was it! Refetch on navigate, all done. Right?….of course not. With the general strategy in place, the devil started sneaking around in all the details. Some of the highlights:

Exemptions

It proved to be not the best idea to evict on all navigation. For example, in Reboot we often preload photo metadata models on pages with lists of photos, in order to make navigation into the photo page snappy. The refetch setup therefore has an exemption config that allows us to easily retain models when navigating into, away from, or between specific site sections.

Child models

We often have parent-child associations between data models. For example, the data model for a photo has a reference to a data model for the author of the photo. When the photo model is refetched, the person model must be refetched as well. This means the function doing the eviction and refetching has to recurse through all child models.

Collections

An issue similar to child models above, but more complex, is the case of a model containing a list of other models. For example, the data model for a person’s photostream contains a list of photo models.

What made this particularly tricky is pagination and filtering — say you load the first 2 pages of your photostream, set your view filter to private, jump to page 5, switch the view to “Date taken”, and navigate away and back to your photostream…imagine the mess of different models with partially-loaded collections. Evicting one parent model, and its children, might evict photo models from the collection within another, without properly refetching. The solution here actually lay in the controller responsible for fetching pages: if a requested page of models is not already completely in-cache, a refetch will always happen to ensure we have all the data, in its freshest state.

Refetch only once per page view

Critical to the refetch-on-navigation strategy is to refetch only once per navigation. This was not too difficult, but essential to get right. We accomplish this by adding a flag when a model is initially fetched and upserted into the cache. When navigating to a new, non-exempt site section, all those flags are cleared, and any model requested by the new page will be refetched. When refetched, the model is again upserted into the cache and marked as fresh, until the next navigation.

But did it fresh?

Go on without me
 

With the thinking and the doing out of the way, it was time to push all this to production. Because these changes are essentially pulling the rug out from underneath the data layer on every navigation, we had to tread very carefully in order to prevent any negative impact to the end user experience.

We did very thorough manual and automated testing across all of Reboot. We left the feature turned on for staff users for a while, to be able to respond to any bug reports. Finally, the time came to test on Real People. There were three things we needed to keep an eye on: errors (of course), impact on page navigation timing, and API traffic. Since refetching implies more requests for data, we needed to be sure that we were keeping the user experience smooth and fast, and also that we weren’t blowing up our data centers.

 
All In
All in
 

In order to get a good read on these things, though, we had to go all in. Letting in just a small percentage of users would not give reliable numbers for timing or traffic impacts, due to the noise inherent in relatively small sample sizes. So, we did something unusual: we turned on refetching for all users for a short period of time. We flipped on refetching and kept an eagle eye on our stats for 2 hours, then reverted; then, we took a careful look at the aggregated data to see how the experiment went.

Surprisingly, the impact on both timing and traffic was relatively low. After some thought, we decided this is most likely because the changes disproportionately impact people on long sessions, say a Flickr tab open for hours or days. Most people don’t hang around that long; they come, they go. Also, the photo page represents north of 90% of our page views, and is exempt from refetching (see Exemptions above).

So where did we end up? A negligible bump in navigation timing and API traffic, and fresher data for all. Perhaps an anticlimactic resolution, but the story we’ve heard today outlines a serious consideration for anyone building an application with a data caching layer: keep in mind from the beginning how you plan to deal with stale data, but in a way that keeps all the other benefits of a single-page application.

#CCC is a breadcat
Busting through staleness. Yep.