About Kay Kremerskothen

Kay is a Community Manager for Flickr and passionate about extraordinary photography. As an editor on Flickr Blog he loves to showcase the beauty and diversity of Flickr in his posts. When he's not blogging or making Flickr more awesome (in front of and behind the scenes), you can find him taking pictures with his beloved Nikon and iPhone, listening to Hans Zimmer's music or playing board games. | On Flickr you can find him at https://flic.kr/quicksilver

Counting & Timing

Here at Flickr, we’re pretty nerdy. We like to measure stuff. We love measuring stuff. The more stuff we can measure, the better our understanding of how different parts of the website work with each other gets. There are two types of measurement we especially like to do – counting and timing. These exciting activities help us to know what is happening when things break – if a page is taking a long time to load, where is that time being spent and what task have we started to do more of.

Counting

Our best friend, when it comes to stats collection and display is RRDTool, the Round-Robin Database. RRD is a magical database that uses a fixed amount of space for storing stats over time. It does this by storing data in decreasingly lower resolution as time goes by, keeping high resolution per-minute data for the last few hours, a lower per-day resolution data for times last year (the resolution is actually configurable). Since you generally care less about detailed data in the past, this works really well. RRD typically works in two modes – you feed it a number every so often and that number can either be a gauge (such as the current temperature) or a counter (such as the number of packets sent through a switch port). Gauges are easy, but RRD is clever enough to know how counters change over time. It then lets you graph these data points in interesting ways.

Counter - Basic RRD

The problem with RRD is that you often want to count some kind of event that doesn’t have a natural counter associated with it. For instance, we might want to count how many times we connect to a certain cache server from a particular piece of code, so that we can graph it. We know when we connect to it in the code, but there’s no global number tracking that count. We need some central counter which we increment each time we connect, which we can then periodically feed into RRD. This tends to get quickly further complicated by having multiple machines on which the code runs – every one of Flickr’s web servers will run this code, but we care about how many times they connect to the cache server in total.

The easiest way to do this is to use a database. When we connect to the cache server, connect to the database and increment a counter. Some other job can then periodically connect to the database, read the counter and store it in RRD. Simple! However, the operation of connecting to the database and performing the write does not have zero cost. It takes some amount of time which may appear trivial at first, but doing that several hundred times a second (lots of the operations we want to track happen very often), multiplied by hundreds of different things we want to track quickly adds up. It’s a little bit quantum theory-ish – the act of observing changes the observed reality. When we’re counting very fast operations, we can’t spend much time on the counting.

There’s luckily an operation which we can perform that’s cheap enough to do very often while allowing us to easily collect data from multiple servers at once: UDP. The User Datagram Protocol is TCP’s ugly kid brother. Unlike TCP, UDP does not guarantee delivery of packets, nor the order in which they’ll be received. In return, they’re a lot faster to send and use less bandwidth. When counting frequent stats, losing the odd packet doesn’t really hurt us, especially when we can also graph when we’ve been dropping packets (under Linux, try netstat -su and look for “packet receive errors”). Changing the UDP buffer size (kernel variable net.core.rmem_max) allows you to queue more packets before processing them, if you find you can’t keep up.

In our client code, this is very easy to do. Every language has a different approach (of course!), but they’re all straight forward.

Perl

	my $sock = new IO::Socket::INET(
			PeerAddr	=> $server,
			PeerPort	=> $port,
			Proto		=> 'udp',
		) or die('Could not connect: $!');

	print $sock "$valuen";

	close $sock;

PHP

	$fp = fsockopen("udp://$server", $port, $errno, $errstr);
	if ($fp){
		fwrite($fp, "$valuen");
		fclose($fp);
	}

The trickier part is the server component. We need something that will receive packets, increment counters and periodically save the counter to RRD files. At Flickr we developed a small Perl daemon to do this, using an alarm to save the counters to RRD. The UDP packet simply contains the name of the counter to increment. We can then start to track interesting operations over time:

Counter - Simple

The next stage was to change our simple counters into ones which recorded state. We’re performing a certain operation 300 times a second, but how often is it failing? RRD allows you to store multiple counters in a single file and then graph them together in a variety of ways, so all we needed was a little modification to our collection daemon.

Counter - Failures

The red area on the left shows some failures. Some tasks have very few relative failures, but which are still important to see. That’s easy enough, but just producing two graphs with different Y-axis scales (something that RRD does auto-magically for us).

Counter - Small Failures

Timing

Counting the number of times we perform a certain task can tell us a lot, but often we’ll want to know how long that task took. The simplest way to do this is to perform a task periodically and graph the result. This is pretty simple and something that we use Ganglia for to good effect.

Timings - Simple

The problem with this is that it just tells us the time it took to perform our single test task. This is useful for looking at macro problems, but is useless against tasks that can take a variable amount of time. If a database server is processing one in ten tasks slowly, then that will appear on this graph as a spike (roughly) every ten samples – it’s impossible for us to tell the different between a server that processes 10% of requests slowly and a sever that has periodic slow periods. We could take an average time (but storing an accumulator along with the counter, and dividing at the end of each period) and this gets us closer to useful data.

What we really need to know, however, is something like our standard deviation – both what our average is and where the bulk of our samples lie. This allows us to distinguish between a request that always takes 20ms and one that takes 10ms half of the time and 30ms half of the time. To achieve this, we changed our collection daemon to keep a list of timings. At the end of each sampling period (somewhere between 10s and 60s, depending on the thing we’re monitoring), we then sort these times, find the mean and the first and third quartiles. By storing each of these values (along with a min and max) into RRD every period, we can show the average timing along with the tolerance.

Timings - Quartiles

So our task takes 250ms on average, with 75% finishing within ~340ms and 25% finishing within ~190ms. The light green shaded area shows the lowest 25%. We don’t shade the upmost 25%, since random spikes can cause the Y-axis to become to large that it makes the graph difficult to read. We don’t have this problem with the lowest quartile, since nothing can be faster than zero (and we always show zero on the Y-axis to avoid scale bias).

Bringing it all together

The next step is to tie together the counting with the timing, to allow us to see how the rate of an operation effected the time which it took to perform. By simply lining up graphs below each other, we can easily see relationships between these different measures:

Timings - Correlate Simple

Of course, we like to go a little measurement-crazy, so we tend to sample as many things as we can to look for patterns. This graph from our Offline Task System shows how often we run a particular task, how long it took to execute but also how long it was queued before we ran it.

Timings - Correlate

The code for out timing daemon is available in the Flickr Code repository and through our trac install, so you can have a play with it yourself. It just requires RRDTool and some patience. Get graphing!

Kitten Tuesday, ASCII Edition

Flickr Asciified - Nihilogic

Jason over at Nihilogic has a post up called Getting your ASCII on with Flickr

“You simply enter the search query of your choice and click the “Asciify” button. A request is then sent off to the Flickr server and any photos that match the query are returned. The image data is analyzed and turned into ASCII characters.”

Obviously I made a Kitten and you too can get flickrasciified over here.

Orginal Photo by pontman.

A Little About The Flickr Bike


A sum of its parts @ Yahoo! Video (for ppl using RSS readers)

Shanan’s been driving me insane riding the Flickr Bike around the office. If you don’t know about the Flickr Bike, in a nutshell we have 20 purple bikes tricked out with Nokia N95s geotagging photos and uploading them to Flickr (of course) as people cycle round, there’s a video (Flickr’s handlebar cameras) over on cnet, and a frankly awesome write up on Lifehacker.

This being the Dev Blog we’re interested in looking under the hood, to use an awful metaphor, and finding out how things work. Fortunately for us Josh wrote up a blog post last week, including links to source code and all that good stuff, meaning I can simple point to that, Huzzah for the internet!

Read techy stuff over at ~~> Coding a Networked Bike

If you’re more of a moving pictures with sounds than words person, then Lifehacker has 6 videos up that are worth a watch: The Making of the Flickr Bikes. The seconds one of which I included at the start of this blog post.

Oh and fyi, they’re Electra bikes painted #7B0099 (Pantone 2602 C).

5 Questions for Jim Bumgardner

We’ve been keeping a careful eye on our Sister Blog to see what they’re up to. Something that’s particularly caught our eye is "5 Questions ", asking the same 5 questions to the Flickrverse, with the last question being who we should ask next. And so, we hope, it goes on and on.

Banana-bub

This is our version, asking questions of those that develop, hack and fiddle with Flickr in new and interesting ways. Of course we couldn’t start with anyone else but KrazyDad (aka Jim Bumgardner).

Jim founded the Flickr Hacks group back in the day, a great place to hang out and ask question if you want to learn how to bend Flickr to your will. In 2006 he also coauthored the Flickr Hacks book for O’Reilly and happily for us he hasn’t stopped tinkering with Flickr yet.

So, without any further ado, 5 Questions for Jim Bumgardner:

1. What are you currently building that integrates with Flickr, or a past favorite that you think is cool, neat, popular and worth telling folks about? Or both.

Jim: It seems like I’m always building something that integrates Flickr. A recent favorite is this interactive mosaic that shows the most interesting photos of the week.

The photos are arranged to form a spiral, a form that appears quite frequently in my work.

www.coverpop.com/pop/flickr_interesting
Coverpop: Most Interesting Photos of the Week

I have a "cron job" which runs on one of my computers at home, which updates this mosaic every week, so the photos in it are always fresh. Incidentally, I prefer to call this process a "cron joy." Oh, nerd humor…

2. What are the best tricks or tips you’ve learned working with the Flickr API?

Jim: I think every Flickr hackr should have access to a powerful high level
graphics library. My library of choice is ImageMagick combined with the Perl programming language (it also works nicely with Ruby), but the GD library, which works with various languages, and PIL, for Python, are also good.

I not only use ImageMagick for building mosaics and graphs, but also for "under the hood" kinds of things, like measuring the average colors of photos for the Colr Pickr (see below).

3. As a Flickr developer what would you like to see Flickr do more of and why?

Jim: One of the very first Flickr hacks I made was the Colr Pickr

www.krazydad.com/colrpickr/

…which allows photos to be selected by color. Since that appeared, I’ve worked on, and seen some fancier variations on the concept, that allow larger quantities of Flickr photos to be selected using multiple colors. But all these systems, require that thousands or even millions of thumbnails be downloaded and analyzed for color. This is because Flickr does not supply "average color" information in its APIs, and cannot provide the color search functionality that this data would enable.

flickr Colr Pickr

I would like to see Flickr provide, via it’s APIs, the three most common colors in each photo (using cluster analysis), and provide a way to search for photos which match one, two, or three colors. These parameters, similar to geocode searches, would need to be combined with some other search parameters, such as tags, to narrow
the field down.

A feature like this would be a godsend to designers. I’ve got sample code for the color analysis, if anyone’s interested… :)

4. What excites you about Flick and hacking? What do you think you’ll build next or would like someone else to build so you don’t have to?

Jim: One thing that excites me is the ability to access large quantities of photos that contain valuable metadata, such as the time the photo was taken, or the geocoded location. I used the ‘date taken’ data to construct this very cool graph of sunsets:

A year of sunsets

While most digital cameras store the time within photos, these days, not enough of them automatically store the location. We have to rely on photographers adding the geocoded information manually, and sadly, not enough of them are geeky enough to do it. I’m looking forward to the day, a few years from now, when most of the new photos on Flickr will also contain geocoded information, as this will enable me to make apps which enable a kind of instant photo-journalism of heavily photographed events, such as rallies and parades. We’re seeing the beginnings of these kind of apps now, but we’re barely scratching the surface.

5. Besides your own, what Flickr projects and hacks do you use on a regular basis? Who should we interview next?

mc-50 map of FlickrLand: flickr's social network

Jim: GustavoG has made some amazing graphs which exploit and illustrate the Flickr social network.

Dan: Thank you, Jim. Next up for our thrilling installment of 5 Questions, GustavoG .

Images from krazydad / jbum , earthhopper and GustavoG.

Flickr Digs Ganglia

A number of people have asked us: where did the pretty graphs from last week’s post come from?

My Day

The answer: Ganglia.

It takes a lot of hardware and software to make a site like Flickr run smoothly. The operations team is responsible for scaling up our monitoring platform to collect all of the metrics we need to make informed decisions about where and when to add new hosts and how urgently, understand how different types of hardware perform with similar real life workloads, determine bottlenecks, and more. Flickr uses Ganglia to collect, aggregate, and visualize those metrics.

So, what is Ganglia? Briefly, "Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and grids"* originally developed at the University of California, Berkeley. Ganglia is typically run by administrators of High Performance Clusters, large groups of machines working together to complete tasks. While we have some machines organized into the traditional cluster configuration, for example for log crunching, we simply define a cluster under Ganglia as a group of machines that do similar things but don’t necessarily interact with one another. For example, all of our web servers in each site are one cluster and all caches another. Our databases are broken up into multiple clusters based on functionality. This way, boxes that should be doing the same kind of work can be easily compared against one another.

Once Ganglia is up and running you’ll see a number of system level statistics reported by each node. You can also easily report custom metrics and have them appear on graphs along with the built in metrics. Before the latest release (the 3.1.x line), this meant some code that calls gmetric and entry in the crontab for scheduling execution of that code. Ganglia 3.1 offers an additional facility for injecting custom metrics that easy to use and offers some additional power and flexibility over the gmetric + cron approach.

For more information on how Ganglia works, see the excellent references on the Ganglia Wiki . The community is active and very helpful.

If you’re just getting started with Ganglia, here are some pointers to save you headaches later:

  • Don’t use the default multicast address 239.2.11.71 for any clusters. You will start gmond without a config file and it will join that first cluster you defined and you will not be happy that your summary graphs are messy.
  • Do put your RRDs on a ramdisk/tmpfs. Your disk(s) will thank you. Don’t forget to setup a job to sync those files to some persistent storage periodically – we do it every 10 minutes.
  • Do consider the frequency of data collection and how long you need to keep the data around (and at what resolution). Do you need to know what the 5 minute load average was on your hosts one year ago today (maybe in relation to other metrics)? Do you need to know how many uploads per second those hosts were doing then (certainly)? While individual metric stores can be resized to keep data around for various amounts of time, typically it’s easier to find the least common denominator – the largest amount of time you think you need this information for – and set that in gmetad.conf. The default gmetad config stores data as:
    • 15 second average for about an hour
    • 6 minute average for about a day
    • 42 minute average for about a week
    • 2 hour 48 minute average for about a month
    • 1 day average for about a year

Once you’re up and running with Ganglia you’ll have access to something like the graph from the last post, what we call a stacked graph. Stacked graphs are intended to provide a quick overview of the relative performance of each host in a cluster. The code has been submitted to the Ganglia project for all to use.

Check out http://ganglia.info for more information and stay tuned for more posts about how Flickr uses Ganglia.

What’s in a Resource?

“Flickr is incredibly lucky, because the core of what we do is focused around images.”

If you’ve ever heard me talk about Internationalizing Flickr, you’ve probably heard me say those words. And it’s true – more than almost any other website, we deal primarily with content which is language-agnostic (and, to a great extent, culturally agnostic).

No matter where we live or what language we speak, our reactions to, say, fluffy kittens have a remarkably similar range (being, approximately, “aww”, “ew” or “achoo!”…)

When we first began to define what Flickr’s international incarnation would look like, our primary concern was preserving this global, cross-cultural feeling – a sense that our members’ photos and our visitors come from all over the world, but that the images on Flickr can “speak” to anyone.

It’s for that reason that Flickr isn’t divided into national silos – there’s no Flickr France, Flickr Brazil or Flickr USA. Much like Highlander, there can be only one Flickr, and it happens to be accessible in a multi-lingual interface.

All well and good, so far.

But the biggest issue I wrestled with (and occasionally still do) was what to do with the site’s URLs. The structure we planned for the site required us to take a definite position on a philosophical issue – what, in actual fact, constitutes the “Resource” defined by the Uniform Resource Locator?

There are two possible schools of thought on this – one which would argue that the photo page http://www.flickr.com/photos/hitherto/257018778/ with a French interface is materially different to the same page when presented with an English interface. The French page, we might argue, should be named http://fr.flickr.com/photos/hitherto/257018778/, http://www.flickr.com/fr-FR/photos/hitherto/257018778/ or something similar.

On the other hand (and, in fact, the hand we eventually chose), the real “resource” here is the photo (and associated metadata) which the page points to – the interface is immaterial to the content we’re presenting. The big advantage of this approach, especially in a multi-lingual world, is that everyone gets the experience best suited to them.

A Korean friend, for example, can send me a link to a photo and I will see it on Flickr with the English interface familiar to me. She, meanwhile, can happily browse Flickr in Korean.

Things perhaps get murkier if we consider other areas of the site – the FAQs at http://flickr.com/help/faq/, for example. Here, all the content is provided by Flickr, and since all of it is in a particular visitor’s chosen language, the entire resource is different.

Even here, though, the principle can be made to apply. If an English-speaking German friend asks where they can get help on Flickr, I don’t have to know which language they prefer to view the site in; I can just point them to the FAQ page, and they will have access to the resource (“help with Flickr”) which they needed.

Now, admittedly, working for a large multi-national company puts me in contact with more than my fair share of people who speak multiple languages, so maybe this matters more to me than to most people. But as our society grows more connected and more mobile, I like to think that the incidences of these kinds of cross-cultural exchanges will only grow.

The biggest downside to our current URL approach comes when search engines try to index our content. Since we don’t have language-specific URLs (and search engine crawlers aren’t in the habit of changing their language settings and retaining the necessary cookies), everything which search engines index on Flickr comes from our English-language interface.

As it happens, depending on how smart the search engines are feeling, this isn’t too much of a problem – we do try to surface photo titles and descriptions so that the abstracts make sense. Still, the results returned by Yahoo! and Google for “Buzios” (a beach resort peninsula near Rio in Brazil) give some idea of the nature of the problem.

Occasionally, when I’m hit by a case of perfectionism, such things keep me awake at night. And I’m sure that someone, somewhere is wailing and gnashing their teeth over how “Flickr are doing URLs wrong”.

All in all, though, I think we made the right decision.

Flickr Engineers Do It Offline

It seems that using queuing systems in web apps is the new hottness . While the basic idea itself certainly isn’t new, its application to modern, large, scalable sites seems to be. At the very least, it’s something that deserves talking about — so here’s how Flickr does it, to the tune of 11 million tasks a day.

But first, a use case! Every time you upload a photo to Flickr, we need to tell three different classes of people about it: 1) You, 2) Your contacts, 3) Everyone else. In a basic application, we’d have a table of your photos and a table containing people who call you a contact. A simple join of these tables, every time someone checks the latest photos from their contacts, works fine. If you’re browsing around everyone’s photos or doing a search, you can just check the single photos table.

Obviously not going to fly here.

For scale, Flickr separates these three lookups into three different places. When you upload that photo, we immediately tell you about it, and get out of your way so you can go off and marvel at it. We don’t make you wait while we tell your contacts about it, insert it into the search system, notify third-party partners, etc. Instead, we insert several jobs into our queueing system to do these steps "later". In practice, every single one of these actions subsequently takes place within 15 seconds of your upload while you’re free to do something else.

Currently, we have ten job queues running in parallel. At creation, each job is randomly assigned a queue and inserted — usually at the end of the queue, but it is possible for certain high-priority tasks to jump to the front. The queues each have a master process running on dedicated hardware. The master is responsible for fetching jobs from the queue, marking them as in-process, giving them to local children to do the work, and marking them as complete when the child is finished. Each task has an ID, task type, arguments, status, create/update date, retry count, and priority. Tasks detect their own errors and either halt or put themselves back in the queue for retry later, when hopefully things have gotten better. Tasks also lock against the objects they’re running on (like accounts or groups) to make sure that other tasks don’t stomp on their data. Masters have an in-memory copy of the queue for speed, but it’s backed by mysql for persistent storage, so we never lose a task.

There are plenty of off-the-shelf popular messaging servers out there . We wrote our own, of course. This isn’t just because we think we’re hot shit, there are also practical reasons. The primary one was that we didn’t need yet another piece of complicated architecture to maintain. The other was that for consistency and maintainability, the queue needed to run on the exact same code as the rest of the site (yes, that means our queuing system is written in php). This makes bug fixes and hacking easier, since we only need to do them in one place. So every time we deploy , the masters detect that, tell all their children to cleanly finish up what they’re doing, and restart themselves to take advantage of the latest and greatest.

It’s not just boring-old notifications, privacy changes, deletes, and denormalization calculations for our queue, no. The system also makes it easy for us to write backfills that automatically parallelize, detect errors, retry, and backoff from overloaded servers. As an essential part of the move-data-to-another-cluster pattern and the omg-we’ve-been-writing-the-wrong-value-there bug fix, this is quickly becoming the most common use of our queue.

The Flickr engineering team is obsessed with making pages load as quickly as possible. To that end, we’re refactoring large amounts of our code to do only the essential work up front, and rely on our queueing system to do the rest.

Cal Henderson Tuesday

We’re taking a break from Kittens this week. Mainly because I’m on Paternity Leave (see Kitten Tuesday for details) and therefore don’t have all week at work to select a suitable kitten, and no-one else on the engineering team is kitten qualified enough.

So instead I give you our very own Cal Henderson (the fellow who has kept Flickr from crashing all these years) presenting at djangocon, on the subject of “Why I Hate Django”.

We don’t use Django here at Flickr (because Cal hates it of course) but we do do a lot of scaling, and not only does Cal give a bit of insight into scaling and how things work here at Flickr, but he also talks about kittens a bit, which is nice.

So if you have an hour to spare (and frankly if you work with the internets you probably do) this is worth a watch…

[Link for RSS readers: http://www.youtube.com/watch?v=i6Fr65PFqfk]