Safer Internet Day and Open Source Codes of Conduct

Last week the world celebrated Safer Internet Day, a day used to call upon stakeholders to join together to make the internet a safer and better place for all, and especially for children and young people. Here at Flickr, we believe in creating spaces on the internet that take into account the safety of all of our contributors, especially our youngest and most underrepresented. So, to celebrate that and to continue the work of making our spaces safer and more accessible to all, we have added a code of conduct to our most trafficked open source repositories on GitHub.

What’s/Why Open Source?


“100_0509” by Nick Quaranto is licensed under CC BY-SA 2.0

Open source is a method of development that allows software to be available publicly so that contributors can modify, add, and remove code as they see fit in order to create a more robust codebase colored with the ideas and innovations of many developers rather than just a few. At Flickr we believe that innovation happens when we have a diverse and widespread set of voices coming together to suggest changes. Open source allows us to harness the power of these voices to create the very best software we can. 

Flickr has 15 open source repositories, 4 of which are actively contributed to. Of those four, none had a formal code of conduct to govern contributions to the code base or interpersonal interactions between developers actively working on the code… until now!

Why a code of conduct?

Codes of conduct are extremely common and important in the open source community. Groups like Linux, Homebrew, Bootstrap, and Kubernetes all have codes of conduct governing the use of and contributions to their open source projects. Because open source allows such a diverse set of voices to express themselves, conflicts can arise and unfortunately not all come with the best of intent. 


“Bullying” by Senado Federal is licensed under CC by 2.0

Codes of conduct allow us to have a preconceived understanding of what interactions in our community are meant to look like and why we hold these expectations of members. Codes of conduct can range from what is expected of interpersonal interactions (e.g. Demonstrate kindness and empathy toward other developers in pull request reviews) to more generalized expectations (e.g. Focus on what is best for the community as a whole rather than individual desires or needs). Codes of conduct not only benefit the community in its entirety, but also allow us to focus on protecting the psychological safety of members of our community who are most at risk. We care about all of our members while also recognizing the need for specific and directed language to protect members of underrepresented groups. The best way to do this is to have a written code of conduct with specific, actionable steps used to govern the safety of the community. 

Why Contributor Covenant?

In order to protect underrepresented groups and to foster a strong and healthy open source community here at Flickr, we thought about whether it would be best to write our own code of conduct specifically tailored to what we value at Flickr or whether it would be better to find a code of conduct already in use that we could use to guide our own open source communities. We ended up finding a code of conduct already in use by quite a few well respected organizations that directly spoke to our most important operating principles.

Contributor Covenant is a code of conduct for participating in open source communities which explicitly outlines expectations in order to create a healthy open source culture. Contributor Covenant has been adopted by over a hundred thousand open source communities and projects since 2014 and is used by Linux, Babel, Bootstrap, Kubernetes,, Twilio, Homebrew-Cask, and Target to name a few. With such well-respected organizations turning to Contributor Covenant, it was something we thought we would be foolish not to consider. 

As we considered, we realized that Contributor Covenant had all of our values specified in a wonderful document that was only a little over a page long. Both accessible in its readability and shortness and robust enough to do the job of protecting underrepresented contributors on our open source repositories, we had found a perfect marriage of all of the things that we wanted in a code of conduct, while also allowing us to become part of a large scale community adopting a singular vision for a healthy, safe, and innovational open source community.

A Pluggable Solution for API Observability on our PHP System

When people think about tech and innovation, they often talk about the “next generation.”

Just use GraphQL and life will be easier, many will tell you.

The future of cloud-native is Lambda, claim others.

Unfortunately, most of the conversations don’t talk about the question that is most top-of-mind for me: what does the next generation of tools look like for legacy systems?

As a Senior Engineering Manager for Flickr’s backend team, here’s one of the major issues my team faces: we have a ton of code that engineers need to understand in order to safely and quickly ship changes. Flickr has built a product loved by millions of photographers for nearly two decades and we have some real history in our code base. You can imagine the amount of work it takes to maintain the stability of our large, complex, public-facing API—which impacts not just customers who use our API, but our own web, mobile, and desktop clients.

The difficulty of wrangling a legacy code base is what led us to be interested in Akita, an observability company going after the dream of “one-click” observability. Akita’s first product passively watches API traffic using packet capture (PCAP) to provide automated API monitoring, automatically infer the structure of API endpoints, and automatically detect potential issues and breaking changes. Akita’s goal is to make it possible for organizations like ours, with our hundreds of thousands of lines of legacy code, to understand system behavior in order to move quickly.

But there’s a catch: Akita’s first product, currently in beta, works only for representational state transfer (REST) APIs. Our API at Flickr, nearly twenty years old, coincides with the rise of REST. This blog post focuses on how I used Akita to introduce observability to our code base.

Moving fast with legacy systems

First, let me give some context on high-level responsibilities of backend engineering at Flickr. Since moving Flickr into the cloud two years ago we’ve had more time to focus on modernizing our services and improving our developer experience. This puts us in a much better position to build new features than before—but first, we need to streamline how we get things done, which is not nearly as simple as it sounds.

Today, we serve up around a billion photos daily from millions of photographers. Nearly every Flickr API request executes legacy code in some way—code that is less tested, less documented, and sometimes dangerous to mess with. A great deal of care has to be taken to avoid disruptions. And when new features need to interact with older features, this can get complex fast! On top of all that, we need to find ways to help our small but mighty team focus their limited time and attention while navigating the old and the new, without the luxury of handing this problem over to an internal tools team.

Our difficulty getting a handle on our legacy systems led us to become excited about using Akita for easy observability. Akita promised to tell us about our API interactions and potential issues with the API, all by passively watching API traffic. But there was, as I mentioned, a catch: Akita works only for REST APIs right now, and our API is… RESTish. Most notably, we never adopted the REST convention of using distinct URL paths for each service endpoint, and we rely heavily on passing parameters through the query string, or form-encoded in POSTs. This situation has historically made it hard for us to use other API tools as well.

Getting Akita to work for my REST-like format

Thankfully our PHP request handlers are plug and play so I quickly whipped up a new proof-of-concept handler showing that we could start getting visibility into our API endpoints and their behavior using Akita. This gave me the ability to generate Akita traces using curl and the Akita command line interface (CLI) tool out of the box, but only within my local dev environment.

A screen shot of the Akita web console. The shot depicts the detected aPI specification of a single Flickr API call.

Right away I spotted some things to improve, and more ideas came that afternoon. I wanted to put our `api_key` parameter into an Authorization header, and remove the `method` parameter since I’d used it in a fake service path. Also, our API returns a 200 HTTP Status on errors, including an element `stat` indicating failure. I wanted those to be HTTP 400s.

But I had a conundrum: Akita works best when observing production traffic. Real, production API requests at production load will really fill in the nooks and crannies of our API models. My progress showed it would be so worth it to go further, so I met with the Akita team and discussed using their Go-based plugin system to transform our live requests into a desirable format based on my proof-of-concept. It turns out that most of Akita’s tooling is open source and I could work on the plugin myself! This turned out to be the key to making Akita work with our RESTish format.

Fitting into the Go plugin format

Exciting news! I just needed to turn my prototype into something that I could run with the Akita agent every time.

The Akita CLI has a mechanism for dynamically loading plugins, which can operate on the captured and parsed data before it is sent to the Akita cloud. My transformations of the API format into a more REST-like format could be packaged that way.

I soon discovered that I was the first person to try building a third-party plugin. Akita told me that they used the plugin architecture internally to package a non-open-source plugin that infers data formats, but that is compiled into the client. 

My early attempts at working with the released CLI version resulted in nothing but discouraging error messages like:

fatal error: runtime: no plugin module data

I worked around this by compiling the open-source version of the Akita CLI myself and pointing the plugin build at the exact same version of the source code. An engineer at Akita reported the same problem and concluded that the plugin needed to be built at the same time as the program that will use it. Go’s idiosyncratic linking conventions seem to make it virtually impossible for such an external plugin to satisfy its dependencies against multiple versions of the base binary. Later, we learned the following from Russ Cox, confirming that our decision to abandon the external plugin approach was wise:

Screenshot of a Tweet from Russ Cox that responds to a quoted question asking what the Go Team's direction for Go Plugins is. Russ answers, "Kind of rudderless right now. Higher priority things are taking all our cycles, so mostly benign neglect for plugins. Sorry."

To make this process repeatable, we adopted a hybrid approach where I added the Flickr-specific transformations of the API as a plugin in a newly created Akita open source repository. (You can check out the code here!) Akita will compile that plugin in all their future CLI builds so there would be no problem with dynamic loading. I can enable the plugin for my traces with a command-line flag and use the most recent version of the CLI without recompiling my plugin to match. This is the same way Akita incorporates modules for type inference. Other users can incorporate contributions in a similar way.

Using Akita to move faster

Now that we have the plugin written, we’re moving toward integration with our production environment. Here’s an example of what we’re able to understand with Akita. Note that the response element has been detected as both datetime and string data types. We should fix that!Another screenshot of the Akita web console. Two entries are highlighted, showing instances of mixed data types detected for the same field.

Here’s what we’re integrating Akita to do:

  • Taking snapshots of our API endpoints. Having a large API footprint makes it all the more important for us to generate defacto specifications and curate the result, rather than try to hand-write specifications from scratch. Once we have a solid OpenAPI3 specification we can make tactical changes to ensure the API adheres to the spec without doing a full-on rewrite of the backend.
  • Identifying changes to our API endpoints. The ability to detect unexpected or off-spec responses will make it a lot easier for us to code from the client side, particularly the Android and iOS mobile apps. We expect to reduce defensive exception handling on the client side, making our mobile code easier to work with and less of a resource hog. 
  • Tracking our inter-service communication as we modernize our infrastructure. Observing the interaction between services is increasingly important as we use more and more microservices and refine our service oriented architecture. For example, having a high level view of impacted services during a production incident will expedite service recovery and get our users back to doing what they love.

While we currently have metrics, monitoring, and logging in place with AWS CloudWatch and Splunk, Akita is able to provide us the information we need in a structured, per-endpoint way, making it easier for our developers to understand what’s going on and focus their attention on what matters. Stay tuned for updates!

Thoughts on tools for legacy systems in general

I see our partnership with Akita as a key part of the beginning of our effort to innovate how to move fast with a legacy system. This problem is not unique to us: Facebook has built multiple type systems for multiple different dynamically typed languages to deal with it! But the fact that we can’t spin off dedicated teams to write compilers for PHP places its own set of constraints. And there are many companies that are in a similar boat: small or medium sized engineering teams of passionate, driven, smart people working on products they love and want you, their customer, to love, too.

I love working on these sorts of problems because they are among the hardest to solve. It takes a lot more than finding a new database or coming up with a faster algorithm; working with large legacy codebases presents challenges that seem intractable. In my experience, you need the right balance of organization, process, tooling, and grit. 

Successful companies eventually reach the point where addressing these things is critical and necessary or delivering value slows to a crawl. I’ve found Flickr to be a unique combination of legacy systems, wonderful engineering heritage, and forward-looking, motivated people. If you work somewhere that would benefit from improved production and development observability, you should check out what Akita is up to. And if you’re interested in working with us here, check out the Flickr jobs page!


Many thanks to Jean Yang, Mark Gritter, and the Akita team for their assistance with this post and our integration with their marvelous new product!

Flickr Engineering Team Vision & Guiding Principles

There’s a rich history of engineering innovation and excellence at Flickr. The team has been involved in the development of specs and open standards, been an early adopter of technologies like NodeJS, and successfully migrated from Yahoo data centers to AWS in less than a year! 

Through all the years, there has been a sense of vision and principles on the team, but nothing formally documented.  We were inspired by Artsy and Amazon to create a team vision and guiding principles, and share those with the team, job candidates, and the public.

We hope that this document evolves with the team, and look forward to discussing it with future coworkers!


Flickr Engineering Team Vision

Flickr Engineering exists to design, build, and maintain software that enables the global community of photography enthusiasts to find inspiration, connect, and share. We succeed by building a culture of innovation, being generous with providing and soliciting feedback, embracing and sharing our strengths, and delivering consistently, reliably, and predictably.


Flickr Engineering Guiding Principles

1. Psychological Safety

You and your coworkers are the most important element of the engineering organization. To learn, grow, and be productive as an engineer, you must feel safe at work. Everyone at Flickr Engineering, especially those in leadership positions, are responsible for fostering a psychologically safe work environment.

Ways to do that will include:

  • Admitting and discussing mistakes
  • Framing work as a learning experience
  • Ensuring communication and teamwork is inclusive and respectful
  • Growing a team comprised of individuals across various diverse backgrounds
  • Engaging in continuous feedback and praise to coworkers
  • Modeling open and respectful communication
  • Sharing knowledge and opportunities to help each other level up

Further Reading:



2. Incremental Revolution

Introduce new technologies slowly and incrementally. Avoid re-writes. Build tools to allow hybrids of different types of technology when possible. Sometimes you need to make a big leap, but aim to approach them incrementally.

Explore bleeding-edge technologies on projects with an end-date that can become safely classed “done.” These can be used to inform decisions on long-running projects. Run spike projects when trying to settle between technology trade-offs.

Examples include:

  • Developing or adapting code to macro-services 
  • Avoid creating more stacks to support, by not anticipating the scale of the work involved



3. Own Your Dependencies

Take the dependencies which fit your problem and make them better. If there’s no perfect match, take a 90% fit and contribute back to get it to 100%.

We use dependencies to save re-inventing, but it doesn’t mean our responsibility stops at installing it. Security patches, updates, roadmap changes are all vital to be aware of and tracked.

Our goal will be to feel like we can influence the design and execution of all the components in our apps. Aim to be a trusted contributor to the communities surrounding your work, communicate clearly and publicly, and be empathetic to the priorities of others.


  • Node modules for NodeJS projects



4. Done Means Done

Being responsible for your code extends beyond delivery date. Done being done means feeling confident that you’ve protected your changes with tests, ensured deployment works, and feel confident in your tools for measuring.

When something is done, it doesn’t mean that you’ll never need to go back to it, but that going back to it is a new project. It’s done.



5. Build for 10x

Technology choices should strive to be optimal while avoiding over-engineering. When designing systems or evaluating scalability and performance, we aim for today’s decisions to withstand 10x the traffic, data, or scale. Flickr is big and we can’t always anticipate the way a feature of a system will be used, especially as things evolve, but scale has always increased. This realistic horizon helps us balance the need to move quickly with the sometimes-competing need to invest in infrastructure and architecture. It also recognizes that solutions are expected to evolve and be replaced.



6. Appreciate What Came Before

We respect our predecessors and the decisions they made. We can’t always know the context, constraints, or reasons for a decision, so we’ll give them the benefit of the doubt.

We appreciate the value of working systems and the lessons they embody. We understand that many problems are not essentially new.

We learn together from mistakes, and appreciate it as an experience that helps us grow.




Flickr is excited to be joining SmugMug!

We’re looking forward to some interesting and challenging engineering projects in the next year, and would love to have more great people join the team!

We want to talk to people who are interested in working on an inclusive, diverse team, building large-scale systems that are backing a much-loved product.

You can learn more about open positions at:

Read our announcement blog post and our extended Q&A for more details.

~The Flickr Team

Introducing Similarity Search at Flickr

At Flickr, we understand that the value in our image corpus is only unlocked when our members can find photos and photographers that inspire them, so we strive to enable the discovery and appreciation of new photos.

To further that effort, today we are introducing similarity search on Flickr. If you hover over a photo on a search result page, you will reveal a “…” button that exposes a menu that gives you the option to search for photos similar to the photo you are currently viewing.

In many ways, photo search is very different from traditional web or text search. First, the goal of web search is usually to satisfy a particular information need, while with photo search the goal is often one of discovery; as such, it should be delightful as well as functional. We have taken this to heart throughout Flickr. For instance, our color search feature, which allows filtering by color scheme, and our style filters, which allow filtering by styles such as “minimalist” or “patterns,” encourage exploration. Second, in traditional web search, the goal is usually to match documents to a set of keywords in the query. That is, the query is in the same modality—text—as the documents being searched. Photo search usually matches across modalities: text to image. Text querying is a necessary feature of a photo search engine, but, as the saying goes, a picture is worth a thousand words. And beyond saving people the effort of so much typing, many visual concepts genuinely defy accurate description. Now, we’re giving our community a way to easily explore those visual concepts with the “…” button, a feature we call the similarity pivot.

The similarity pivot is a significant addition to the Flickr experience because it offers our community an entirely new way to explore and discover the billions of incredible photos and millions of incredible photographers on Flickr. It allows people to look for images of a particular style, it gives people a view into universal behaviors, and even when it “messes up,” it can force people to look at the unexpected commonalities and oddities of our visual world with a fresh perspective.

What is “similarity”?

To understand how an experience like this is powered, we first need to understand what we mean by “similarity.” There are many ways photos can be similar to one another. Consider some examples.

It is apparent that all of these groups of photos illustrate some notion of “similarity,” but each is different. Roughly, they are: similarity of color, similarity of texture, and similarity of semantic category. And there are many others that you might imagine as well.

What notion of similarity is best suited for a site like Flickr? Ideally, we’d like to be able to capture multiple types of similarity, but we decided early on that semantic similarity—similarity based on the semantic content of the photos—was vital to facilitate discovery on Flickr. This requires a deep understanding of image content for which we employ deep neural networks.

We have been using deep neural networks at Flickr for a while for various tasks such as object recognition, NSFW prediction, and even prediction of aesthetic quality. For these tasks, we train a neural network to map the raw pixels of a photo into a set of relevant tags, as illustrated below.

Internally, the neural network accomplishes this mapping incrementally by applying a series of transformations to the image, which can be thought of as a vector of numbers corresponding to the pixel intensities. Each transformation in the series produces another vector, which is in turn the input to the next transformation, until finally we have a vector that we specifically constrain to be a list of probabilities for each class we are trying to recognize in the image. To be able to go from raw pixels to a semantic label like “hot air balloon,” the network discards lots of information about the image, including information about  appearance, such as the color of the balloon, its relative position in the sky, etc. Instead, we can extract an internal vector in the network before the final output.

For common neural network architectures, this vector—which we call a “feature vector”—has many hundreds or thousands of dimensions. We can’t necessarily say with certainty that any one of these dimensions means something in particular as we could at the final network output, whose dimensions correspond to tag probabilities. But these vectors have an important property: when you compute the Euclidean distance between these vectors, images containing similar content will tend to have feature vectors closer together than images containing dissimilar content. You can think of this as a way that the network has learned to organize information present in the image so that it can output the required class prediction. This is exactly what we are looking for: Euclidian distance in this high-dimensional feature space is a measure of semantic similarity. The graphic below illustrates this idea: points in the neighborhood around the query image are semantically similar to the query image, whereas points in neighborhoods further away are not.

This measure of similarity is not perfect and cannot capture all possible notions of similarity—it will be constrained by the particular task the network was trained to perform, i.e., scene recognition. However, it is effective for our purposes, and, importantly, it contains information beyond merely the semantic content of the image, such as appearance, composition, and texture. Most importantly, it gives us a simple algorithm for finding visually similar photos: compute the distance in the feature space of a query image to each index image and return the images with lowest distance. Of course, there is much more work to do to make this idea work for billions of images.

Large-scale approximate nearest neighbor search

With an index as large as Flickr’s, computing distances exhaustively for each query is intractable. Additionally, storing a high-dimensional floating point feature vector for each of billions of images takes a large amount of disk space and poses even more difficulty if these features need to be in memory for fast ranking. To solve these two issues, we adopt a state-of-the-art approximate nearest neighbor algorithm called Locally Optimized Product Quantization (LOPQ).

To understand LOPQ, it is useful to first look at a simple strategy. Rather than ranking all vectors in the index, we can first filter a set of good candidates and only do expensive distance computations on them. For example, we can use an algorithm like k-means to cluster our index vectors, find the cluster to which each vector is assigned, and index the corresponding cluster id for each vector. At query time, we find the cluster that the query vector is assigned to and fetch the items that belong to the same cluster from the index. We can even expand this set if we like by fetching items from the next nearest cluster.

This idea will take us far, but not far enough for a billions-scale index. For example, with 1 billion photos, we need 1 million clusters so that each cluster contains an average of 1000 photos. At query time, we will have to compute the distance from the query to each of these 1 million cluster centroids in order to find the nearest clusters. This is quite a lot. We can do better, however, if we instead split our vectors in half by dimension and cluster each half separately. In this scheme, each vector will be assigned to a pair of cluster ids, one for each half of the vector. If we choose k = 1000 to cluster both halves, we have k2= 1000 * 1000 = 1e6 possible pairs. In other words, by clustering each half separately and assigning each item a pair of cluster ids, we can get the same granularity of partitioning (1 million clusters total) with only 2 * 1000 distance computations with half the number of dimensions for a total computational savings of 1000x. Conversely, for the same computational cost, we gain a factor of k more partitions of the data space, providing a much finer-grained index.

This idea of splitting vectors into subvectors and clustering each split separately is called product quantization. When we use this idea to index a dataset it is called the inverted multi-index, and it forms the basis for fast candidate retrieval in our similarity index. Typically the distribution of points over the clusters in a multi-index will be unbalanced as compared to a standard k-means index, but this unbalance is a fair trade for the much higher resolution partitioning that it buys us. In fact, a multi-index will only be balanced across clusters if the two halves of the vectors are perfectly statistically independent. This is not the case in most real world data, but some heuristic preprocessing—like PCA-ing and permuting the dimensions so that the cumulative per-dimension variance is approximately balanced between the halves—helps in many cases. And just like the simple k-means index, there is a fast algorithm for finding a ranked list of clusters to a query if we need to expand the candidate set.

After we have a set of candidates, we must rank them. We could store the full vector in the index and use it to compute the distance for each candidate item, but this would incur a large memory overhead (for example, 256 dimensional vectors of 4 byte floats would require 1Tb for 1 billion photos) as well as a computational overhead. LOPQ solves these issues by performing another product quantization, this time on the residuals of the data. The residual of a point is the difference vector between the point and its closest cluster centroid. Given a residual vector and the cluster indexes along with the corresponding centroids, we have enough information to reproduce the original vector exactly. Instead of storing the residuals, LOPQ product quantizes the residuals, usually with a higher number of splits, and stores only the cluster indexes in the index. For example, if we split the vector into 8 splits and each split is clustered with 256 centroids, we can store the compressed vector with only 8 bytes regardless of the number of dimensions to start (though certainly a higher number of dimensions will result in higher approximation error). With this lossy representation we can produce a reconstruction of a vector from the 8 byte codes: we simply take each quantization code, look up the corresponding centroid, and concatenate these 8 centroids together to produce a reconstruction. Likewise, we can approximate the distance from the query to an index vector by computing the distance between the query and the reconstruction. We can do this computation quickly for many candidate points by computing the squared difference of each split of the query to all of the centroids for that split. After computing this table, we can compute the squared difference for an index point by looking up the precomputed squared difference for each of the 8 indexes and summing them together to get the total squared difference. This caching trick allows us to quickly rank many candidates without resorting to distance computations in the original vector space.

LOPQ adds one final detail: for each cluster in the multi-index, LOPQ fits a local rotation to the residuals of the points that fall in that cluster. This rotation is simply a PCA that aligns the major directions of variation in the data to the axes followed by a permutation to heuristically balance the variance across the splits of the product quantization. Note that this is the exact preprocessing step that is usually performed at the top-level multi-index. It tends to make the approximate distance computations more accurate by mitigating errors introduced by assuming that each split of the vector in the production quantization is statistically independent from other splits. Additionally, since a rotation is fit for each cluster, they serve to fit the local data distribution better.

Below is a diagram from the LOPQ paper that illustrates the core ideas of LOPQ. K-means (a) is very effective at allocating cluster centroids, illustrated as red points, that target the distribution of the data, but it has other drawbacks at scale as discussed earlier. In the 2d example shown, we can imagine product quantizing the space with 2 splits, each with 1 dimension. Product Quantization (b) clusters each dimension independently and cluster centroids are specified by pairs of cluster indexes, one for each split. This is effectively a grid over the space. Since the splits are treated as if they were statistically independent, we will, unfortunately, get many clusters that are “wasted” by not targeting the data distribution. We can improve on this situation by rotating the data such that the main dimensions of variation are axis-aligned. This version, called Optimized Product Quantization (c), does a better job of making sure each centroid is useful. LOPQ (d) extends this idea by first coarsely clustering the data and then doing a separate instance of OPQ for each cluster, allowing highly targeted centroids while still reaping the benefits of product quantization in terms of scalability.

LOPQ is state-of-the-art for quantization methods, and you can find more information about the algorithm, as well as benchmarks, here. Additionally, we provide an open-source implementation in Python and Spark which you can apply to your own datasets. The algorithm produces a set of cluster indexes that can be queried efficiently in an inverted index, as described. We have also explored use cases that use these indexes as a hash for fast deduplication of images and large-scale clustering. These extended use cases are studied here.


We have described our system for large-scale visual similarity search at Flickr. Techniques for producing high-quality vector representations for images with deep learning are constantly improving, enabling new ways to search and explore large multimedia collections. These techniques are being applied in other domains as well to, for example, produce vector representations for text, video, and even molecules. Large-scale approximate nearest neighbor search has importance and potential application in these domains as well as many others. Though these techniques are in their infancy, we hope similarity search provides a useful new way to appreciate the amazing collection of images at Flickr and surface photos of interest that may have previously gone undiscovered. We are excited about the future of this technology at Flickr and beyond.


Yannis Kalantidis, Huy Nguyen, Stacey Svetlichnaya, Arel Cordero. Special thanks to the rest of the Computer Vision and Machine Learning team and the Vespa search team who manages Yahoo’s internal search engine.

A Year Without a Byte

One of the largest cost drivers in running a service like Flickr is storage. We’ve described multiple techniques to get this cost down over the years: use of COS, creating sizes dynamically on GPUs and perceptual compression. These projects have been very successful, but our storage cost is still significant.
At the beginning of 2016, we challenged ourselves to go further — to go a full year without needing new storage hardware. Using multiple techniques, we got there.

The Cost Story

A little back-of-the-envelope math shows storage costs are a real concern. On a very high-traffic day, Flickr users upload as many as twenty-five million photos. These photos require an average of 3.25 megabytes of storage each, totalling over 80 terabytes of data. Stored naively in a cloud service similar to S3, this day’s worth of data would cost over $30,000 per year, and continue to incur costs every year.

And a very large service will have over two hundred million active users. At a thousand images each, storage in a service similar to S3 would cost over $250 million per year (or $1.25 / user-year) plus network and other expenses. This compounds as new users sign up and existing users continue to take photos at an accelerating rate. Thankfully, our costs, and every large service’s costs, are different than storing naively at S3, but remain significant.

Cost per byte have decreased, but bytes per image from iPhone-type platforms have increased. Cost per image hasn’t changed significantly.

Storage costs do drop over time. For example, S3 costs dropped from $0.15 per gigabyte month in 2009 to $0.03 per gigabyte-month in 2014, and cloud storage vendors have added low-cost options for data that is infrequently accessed. NAS vendors have also delivered large price reductions.

Unfortunately, these lower costs per byte are counteracted by other forces. On iPhones, increasing camera resolution, burst mode and the addition of short animations (Live Photos) have increased bytes-per-image rapidly enough to keep storage cost per image roughly constant. And iPhone images are far from the largest.

In response to these costs, photo storage services have pursued a variety of product options. To name a few: storing lower quality images or re-compressing, charging users for their data usage, incorporating advertising, selling associated products such as prints, and tying storage to purchases of handsets.

There are also a number of engineering approaches to controlling storage costs. We sketched out a few and cover three that we implemented below: adjusting thresholds on our storage systems, rolling out existing savings approaches to more images, and deploying lossless JPG compression.

Adjusting Storage Thresholds

As we dug into the problem, we looked at our storage systems in detail. We discovered that our settings were based on assumptions about high write and delete loads that didn’t hold. Our storage is pretty static. Users only rarely delete or change images once uploaded. We also had two distinct areas of just-in-case space. 5% of our storage was reserved space for snapshots, useful for undoing accidental deletes or writes, and 8.5% was held free in reserve. This resulted in about 13% of our storage going unused. Trade lore states that disks should remain 10% free to avoid performance degradation, but we found 5% to be sufficient for our workload. So we combined our our two just-in-case areas into one and reduced our free space threshold to that level. This was our simplest approach to the problem (by far), but it resulted in a large gain. With a couple simple configuration changes, we freed up more than 8% of our storage.

Adjusting storage thresholds

Extending Existing Approaches

In our earlier posts, we have described dynamic generation of thumbnail sizes and perceptual compression. Combining the two approaches decreased thumbnail storage requirements by 65%, though we hadn’t applied these techniques to many of our images uploaded prior to 2014. One big reason for this: large-scale changes to older files are inherently risky, and require significant time and engineering work to do safely.

Because we were concerned that further rollout of dynamic thumbnail generation would place a heavy load on our resizing infrastructure, we targeted only thumbnails from less-popular images for deletes. Using this approach, we were able to handle our complete resize load with just four GPUs. The process put a heavy load on our storage systems; to minimize the impact we randomized our operations across volumes. The entire process took about four months, resulting in even more significant gains than our storage threshold adjustments.

Decreasing the number of thumbnail sizes

Lossless JPG Compression

Flickr has had a long-standing commitment to keeping uploaded images byte-for-byte intact. This has placed a floor on how much storage reduction we can do, but there are tools that can losslessly compress JPG images. Two well-known options are PackJPG and Lepton, from Dropbox. These tools work by decoding the JPG, then very carefully compressing it using a more efficient approach. This typically shrinks a JPG by about 22%. At Flickr’s scale, this is significant. The downside is that these re-compressors use a lot of CPU. PackJPG compresses at about 2MB/s on a single core, or about fifteen core-years for a single petabyte worth of JPGs. Lepton uses multiple cores and, at 15MB/s, is much faster than packJPG, but uses roughly the same amount of CPU time.

This CPU requirement also complicated on-demand serving. If we recompressed all the images on Flickr, we would need potentially thousands of cores to handle our decompress load. We considered putting some restrictions on access to compressed images, such as requiring users to login to access original images, but ultimately found that if we targeted only rarely accessed private images, decompressions would occur only infrequently. Additionally, restricting the maximum size of images we compressed limited our CPU time per decompress. We rolled this out as a component of our existing serving stack without requiring any additional CPUs, and with only minor impact to user experience.

Running our users’ original photos through lossless compression was probably our highest-risk approach. We can recreate thumbnails easily, but a corrupted source image cannot be recovered. Key to our approach was a re-compress-decompress-verify strategy: every recompressed image was decompressed and compared to its source before removing the uncompressed source image.

This is still a work-in-progress. We have compressed many images but to do our entire corpus is a lengthy process, and we had reached our zero-new-storage-gear goal by mid-year.

On The Drawing Board

We have several other ideas which we’ve investigated but haven’t implemented yet.

In our current storage model, we have originals and thumbnails available for every image, each stored in two datacenters. This model assumes that the images need to be viewable relatively quickly at any point in time. But private images belonging to accounts that have been inactive for more than a few months are unlikely to be accessed. We could “freeze” these images, dropping their thumbnails and recreate them when the dormant user returns. This “thaw” process would take under thirty seconds for a typical account. Additionally, for photos that are private (but not dormant), we could go to a single uncompressed copy of each thumbnail, storing a compressed copy in a second datacenter that would be decompressed as needed.

We might not even need two copies of each dormant original image available on disk. We’ve pencilled out a model where we place one copy on a slower, but underutilized, tape-based system while leaving the other on disk. This would decrease availability during an outage, but as these images belong to dormant users, the effect would be minimal and users would still see their thumbnails. The delicate piece here is the placement of data, as seeks on tape systems are prohibitively slow. Depending on the details of what constitutes a “dormant” photo these techniques could comfortably reduce storage used by over 25%.

We’ve also looked into de-duplication, but we found our duplicate rate is in the 3% range. Users do have many duplicates of their own images on their devices, but these are excluded by our upload tools.  We’ve also looked into using alternate image formats for our thumbnail storage.    WebP can be much more compact than ordinary JPG but our use of perceptual compression gets us close to WebP byte size and permits much faster resize.  The BPG project proposes a dramatically smaller, H.265 based encoding but has IP and other issues.

There are several similar optimizations available for videos. Although Flickr is primarily image-focused, videos are typically much larger than images and consume considerably more storage.


Optimization over several releases

Since 2013 we’ve optimized our usage of storage by nearly 50%.  Our latest efforts helped us get through 2016 without purchasing any additional storage,  and we still have a few more options available.

Peter Norby, Teja Komma, Shijo Joy and Bei Wu formed the core team for our zero-storage-budget project. Many others assisted the effort.

Personalized Group Recommendations on Flickr

There are two primary paradigms for the discovery of digital content. First is the search paradigm, in which the user is actively looking for specific content using search terms and filters (e.g., Google web search, Flickr image search, Yelp restaurant search, etc.). Second is a passive approach, in which the user browses content presented to them (e.g., NYTimes news, Flickr Explore, and Twitter trending topics). Personalization benefits both approaches by providing relevant content that is tailored to users’ tastes (e.g., Google News, Netflix homepage, LinkedIn job search, etc.). We believe personalization can improve the user experience at Flickr by guiding both new as well as more experienced members as they explore photography. Today, we’re excited to bring you personalized group recommendations.

Flickr Groups are great for bringing people together around a common theme, be it a style of photography, camera, place, event, topic, or just some fun. Community members join for several reasons—to consume photos, to get feedback, to play games, to get more views, or to start a discussion about photos, cameras, life or the universe. We see value in connecting people with appropriate groups based on their interests. Hence, we decided to start the personalization journey by providing contextually relevant and personalized content that is tuned to each person’s unique taste.

Of course, in order to respect users’ privacy, group recommendations only consider public photos and public groups. Additionally, recommendations are private to the user. In other words, nobody else sees what is recommended to an individual.

In this post we describe how we are improving Flickr’s group recommendations. In particular, we describe how we are replacing a curated, non-personalized, static list of groups with a dynamic group recommendation engine that automatically generates new results based on user interactions to provide personalized recommendations unique to each person. The algorithms and backend systems we are building are broad and applicable to other scenarios, such as photo recommendations, contact recommendations, content discovery, etc.


Figure: Personalized group recommendations


One challenge of recommendations is determining a user’s interests. These interests could be user-specified, explicit preferences or could be inferred implicitly from their actions, supported by user feedback. For example:

  • Explicit:
    • Ask users what topics interest them
    • Ask users why they joined a particular group
  • Implicit:
    • Infer user tastes from groups they join, photos they like, and users they follow
    • Infer why users joined a particular group based on their activity, interactions, and dwell time
  • Feedback:
    • Get feedback on recommended items when users perform actions such as “Join” or “Follow” or click “Not interested”

Another challenge of recommendations is figuring out group characteristics. I.e.: what type of group is it? What interests does it serve? What brings Flickr members to this group? We can infer this by analyzing group members, photos posted to the group, discussions and amount of activity in the group.

Once we have figured out user preferences and group characteristics, recommendations essentially becomes a matchmaking process. At a high-level, we want to support 3 use cases:

  • Use Case # 1: Given a group, return all groups that are “similar”
  • Use Case # 2: Given a user, return a list of recommended groups
  • Use Case # 3: Given a photo, return a list of groups that the photo could belong to

Collaborative Filtering

One approach to recommender systems is presenting similar content in the current context of actions. For example, Amazon’s “Customers who bought this item also bought” or LinkedIn’s “People also viewed.” Item-based collaborative filtering can be used for computing similar items.


Figure: Collaborative filtering in action

By Moshanin (Own work) [CC BY-SA 3.0] from Wikipedia

Intuitively, two groups are similar if they have the same content or same set of users. We observed that users often post the same photo to multiple groups. So, to begin, we compute group similarity based on a photo’s presence in multiple groups.  

Consider the following sample matrix M(Gi -> Pj) constructed from group photo pools, where 1 means a corresponding group (Gi) contains an image, and empty (0) means a group does not contain the image.


From this, we can compute M.M’ (M’s transpose), which gives us the number of common photos between every pair of groups (Gi, Gj):


We use modified cosine similarity to compute a similarity score between every pair of groups:


To make this calculation robust, we only consider groups that have a minimum of X photos and keep only strong relationships (i.e., groups that have at least Y common photos). Finally, we use the similarity scores to come up with the top k-nearest neighbors for each group.

We also compute group similarity based on group membership —i.e., by defining group-user relationship (Gi -> Uj) matrix. It is interesting to note that the results obtained from this relationship are very different compared to (Gi, Pj) matrix. The group-photo relationship tends to capture groups that are similar by content (e.g.,“macro photography”). On the other hand, the group-user relationship gives us groups that the same users have joined but are possibly about very different topics, thus providing us with a diversity of results. We can extend this approach by computing group similarity using other features and relationships (e.g., autotags of photos to cluster groups by themes, geotags of photos to cluster groups by place, frequency of discussion to cluster groups by interaction model, etc.).

Using this we can easily come up with a list of similar groups (Use Case # 1). We can either merge the results obtained by different similarity relationships into a single result set, or keep them separate to power features like “Other groups similar to this group” and “People who joined this group also joined.”

We can also use the same data for recommending groups to users (Use Case # 2). We can look at all the groups that the user has already joined and recommend groups similar to those.

To come up with a list of relevant groups for a photo (Use Case # 3), we can compute photo similarity either by using a similar approach as above or by using Flickr computer vision models for finding photos similar to the query photo. A simple approach would then be to recommend groups that these similar photos belong to.


Due to the massive scale (millions of users x 100k groups) of data, we used Yahoo’s Hadoop Stack to implement the collaborative filtering algorithm. We exploited sparsity of entity-item relationship matrices to come up with a more efficient model of computation and used several optimizations for computational efficiency. We only need to compute the similarity model once every 7 days, since signals change slowly.


Figure: Computational architecture

(All logos and icons are trademarks of respective entities)


Similarity scores and top k-nearest neighbors for each group are published to Redis for quick lookups needed by the serving layer. Recommendations for each user are computed in real-time when the user visits the groups page. Implementation of the serving layer takes care of a few aspects that are important from usability and performance point-of-view:

  • Freshness of results: Users hate to see the same results being offered even though they might be relevant. We have implemented a randomization scheme that returns fresh results every X hours, while making sure that results stay static over a user’s single session.
  • Diversity of results: Diversity of results in recommendations is very important since a user might not want to join a group that is very similar to a group he’s already involved in. We require a good threshold that balances similarity and diversity. To improve diversity further, we combine recommendations from different algorithms. We also cluster the user’s groups into diverse sets before computing recommendations.
  • Dynamic results: Users expect their interactions to have a quick effect on recommendations. We thus incorporate user interactions while making subsequent recommendations so that the system feels dynamic.
  • Performance: Recommendation results are cached so that API response is quick on subsequent visits.

Cold Start

The drawback to collaborative filtering is that it cannot offer recommendations to new users who do not have any associations. For these users, we plan to recommend groups from an algorithmically computed list of top/trending groups alongside manual curation. As users interact with the system by joining groups, the recommendations become more personalized.

Measuring Effectiveness

We use qualitative feedback from user studies and alpha group testing to understand user expectation and to guide initial feature design. However, for continued algorithmic improvements, we need an objective quantitative metric. Recommendation results by their very nature are subjective, so measuring effectiveness is tricky. The usual approach taken is to roll out to a random population of users and measure the outcome of interest for the test group as compared to the control group (ref: A/B testing).

We plan to employ this technique and measure user interaction and engagement to keep improving the recommendation algorithms. Additionally, we plan to measure explicit signals such as when users click “Not interested.” This feedback will also be used to fine-tune future recommendations for users.


Figure: Measuring user engagement

Future Directions

While we’re seeing good initial results, we’d like to continue improving the algorithms to provide better results to the Flickr community. Potential future directions can be classified broadly into 3 buckets: algorithmic improvements, new product use cases, and new recommendation applications.

If you’d like to help, we’re hiring. Check out our jobs page and get in touch.

Product Engineering: Mehul Patel, Chenfan (Frank) Sun,  Chinmay Kini

We Want You… and Your Teammates

14493569810_7ac064e3c4_oWe’re hiring here at Flickr and we got pretty excited the other week when we saw Stripe’s post: BYOT (Bring Your Own Team). The sum of the parts is greater than the whole and all that. Genius <big hat tip to them>.

In case you didn’t read Stripe’s post, here’s the gist: you’re a team player, you like to make an impact, focus on a tough problem, set a challenging goal, and see the fruits of your labor after blood, sweat, and tears (or, maybe just brainpower). But you’ve got the itch to collaborate, to talk an idea through, break it down, and parallelize tasks or simply to be around your mates through work and play. Turns out you already have your go-to group of colleagues, roommates, siblings, or buddies that push, inspire, and get the best out of you. Well, in that case we may want to hire all of you!

Like Stripe, we understand the importance of team dynamics. So if you’ve already got something good going on, we want in on it too. We love Stripe and are stoked for this initiative of theirs, but if Flickr tickles your fancy (and it does ours :) consider bringing that team of yours this way too, especially if you’ve got a penchant for mobile development. We’d love to chat!

Email us: jobs at

Team crop

Photos by: @Chris Martin and @Captain Eric Willis

Introducing yakbak: Record and playback HTTP interactions in NodeJS

Did you know that the new Front End of is one big Flickr API client? Writing a client for an existing API or service can be a lot of fun, but decoupling and testing that client can be quite tricky. There are many different approaches to taking the backing service out of the equation when it comes to writing tests for client code. Today we’ll discuss the pros and cons of some of these approaches, describe how the Flickr Front End team tests service-dependent libraries, and introduce you to our new NodeJS HTTP playback module: yakbak!

Scenario: Testing a Flickr API Client

Let’s jump into some code, shall we? Suppose we’re testing a (very, very simple) photo search API client:

Currently, this code will make an HTTP request to the Flickr API on every test run. This is less than desirable for several reasons:

  • UGC is unpredictable. In this test, we’re asserting that the response code is an HTTP 200, but obviously our client code needs to provide the response data to be useful. It’s impossible to write a meaningful and predictable test against live content.
  • Traffic is unpredictable. This photos search API call usually takes ~150ms for simple queries, but a more complex query or a call during peak traffic may take longer.
  • Downtime is unpredictable. Every service has downtime (the term is “four nines,” not “one hundred percent” for a reason), and if your service is down, your client tests will fail.
  • Networks are unpredictable. Have you ever tried coding on a plane? Enough said.

We want our test suite to be consistent, predictable, and fast. We’re also only trying to test our client code, not the API. Let’s take a look at some ways to replace the API with a control, allowing us to predictably test the client code.

Approach 1: Stub the HTTP client methods

We’re using superagent as our HTTP client, so we could use a mocking library like sinon to stub out superagent’s Request methods:

With these changes, we never actually make an HTTP request to the API during a test run. Now our test is predictable, controlled, and it runs crazy fast. However, this approach has some major drawbacks:

  • Tightly coupled with superagent. We’re all up in the client’s implementation details here, so if superagent ever changes their API, we’ll need to correct our tests to match. Likewise, if we ever want to use a different HTTP client, we’ll need to correct our tests as well.
  • Difficult to specify the full HTTP response. Here we’re only specifying the statusCode; what about when we need to specify the body or the headers? Talk about verbose.
  • Not necessarily accurate. We’re trusting the test author to provide a fake response that matches what the actual server would send back. What happens if the API changes the response schema? Some unhappy developer will have to manually update the tests to match reality (probably an intern, let’s be honest).

We’ve at least managed to replace the service with a control in our tests, but we can do (slightly) better.

Approach 2: Mock the NodeJS HTTP module

Every NodeJS HTTP client will eventually delegate to the standard NodeJS http module to perform the network request. This means we can intercept the request at a low level by using a tool like nock:

Great! We’re no longer stubbing out superagent and we can still control the HTTP response. This avoids the HTTP client coupling from the previous step, but still has many similar drawbacks:

  • We’re still completely implementation-dependent. If we want to pass a new query string parameter to our service, for example, we’ll also need to add it to the test so that nock will match the request.
  • It’s still laborious to specify the response headers, body, etc.
  • It’s still difficult to make sure the response body always matches reality.

At this point, it’s worth noting that none of these bullet points were an issue back when we were actually making the HTTP request. So, let’s do exactly that (once!).

Approach 3: Record and playback the HTTP interaction

The Ruby community created the excellent VCR gem for recording and replaying HTTP interactions during tests. Recorded HTTP requests exist as “tapes”, which are just files with some sort of format describing the interaction. The basic workflow goes like this:

  1. The client makes an actual HTTP request.
  2. VCR sits in front of the system’s HTTP library and intercepts the request.
  3. If VCR has a tape matching the request, it simply replays the response to the client.
  4. Otherwise, VCR lets the HTTP request through to the service, records the interaction to a new tape on disk and plays it back to the client.

Introducing yakbak

Today we’re open-sourcing yakbak, our take on recording and playing back HTTP interactions in NodeJS. Here’s what our tests look like with a yakbak proxy:

Here we’ve created a standard NodeJS http.Server with our proxy middleware. We’ve also configured our client to point to the proxy server instead of the origin service. Look, no implementation details!

yakbak tries to do things The Node Way™ wherever possible. For example, each yakbak “tape” is actually its own module that simply exports an http.Server handler, which allows us to do some really cool things. For example, it’s trivial to create a server that always responds a certain way. Since the tape’s hash is based solely on the incoming request, we can easily edit the response however we like. We’re also kicking around a handful of enhancements that should make yakbak an even more powerful development tool.

Thanks to yakbak, we’ve been writing fast, consistent, and reliable tests for our HTTP clients and applications. Want to give it a spin? Check it out today:

P.S. We’re hiring!

Do you love development tooling and helping keep teams on the latest and greatest technology? Or maybe you just want to help build the best home for your photos on the entire internet? We’re hiring Front End Ops and tons of other great positions. We’d love to hear from you!

Our Justified Layout Goes Open Source

We introduced the justified layout on late in 2011. Our community of photographers loved it for its ability to efficiently display many photos at their native aspect ratio with visually pleasing, consistent whitespace, so we quickly added it to the rest of the website.

Justified Example

It’s been through many iterations and optimizations. From back when we were primarily on the PHP stack to our lovely new JavaScript based isomorphic stack. Last year Eric Socolofsky did a great job explaining how the algorithm works and how it fits into a larger infrastructure for Flickr specifically.

In the years following its launch, we’ve had requests from our front end colleagues in other teams across Yahoo for a reusable package that does photo (or any rectangle) presentation like this, but it’s always been too tightly coupled to our stack to separate it out and hand it over. Until now! Today we’re publishing the justified-layout algorithm wrapped in an npm module for you to use on the server, or client, in your own projects.


npm install justified-layout --save

Or grab it directly from Github.

Using it

It’s really easy to use. No configuration is required. Just pass in an array of aspect ratios representing the photos/boxes you’d like to lay out:

var layoutGeometry = require('justified-layout')([1.33, 1, 0.65] [, config]);

If you only have dimensions and don’t want an extra step to convert them to aspect ratios, you can pass in an array of widths and heights like this:

What it returns

The geometry data for the layout items, in the same order they’re passed in.

This is the extent of what the module provides. There’s no rendering component. It’s up to you to use this data to render boxes the way you want. Use absolute positioning, background positions, canvas, generate a static image on the backend, whatever you like! There’s a very basic implementation used on the demo and docs page.


It’s highly likely the defaults don’t satisfy your requirements; they don’t even satisfy ours. There’s a full set of configuration options to customize the output just the way you want. My favorite is the fullWidthBreakoutRowCadence option that we use on album pages. All config options are documented on the docs and demo page.


  • Latest Chrome
  • Latest Safari
  • Latest Firefox
  • Latest Mobile Safari
  • IE 9+
  • Node 0.10+

The future

The justified layout algorithm is just one part of our photo list infrastructure. Following this, we’ll be open sourcing more modules for handling data, handling state, reverse layouts, appending and prepending items for pagination.

We welcome your feedback, issues and contributions on Github.

P.S. Open Source at Flickr

This is the first of quite a bit of code we have in the works for open source release. If working on open source projects appeals to you, we’re hiring!