The Shape of Alpha

Untitled #1221325485

We have a lot of geotagged photos

Almost ninety million, as I write this, and the numbers keep growing especially as nearly every new smart phone released to market has not only a camera but also the ability to capture location information with it.

For every geotagged photo we store up to six Where On Earth (WOE) IDs. These are unique numeric identifiers that correspond to the hierarchy of places where a photo was taken: the neighbourhood, the town, the county, and so on up to the continent. This process is usually referred to as reverse-geocoding.

Over time this got us wondering: If we plotted all the geotagged photos associated with a particular WOE ID, would we have enough data to generate a mostly accurate contour of that place? Not a perfect representation, perhaps, but something more fine-grained than a bounding box. It turns out we can.

So, starting today there are 150,000 (and counting) WOE IDs with proper (-ish) shape data, available via the Flickr API. What kind of shapes, you ask?

Continents:

 

Countries:

 

States and cities:

 

Even neighbourhoods:

 

Each one of those illustrations represents the boundaries of a particular place whose outline was generated using nothing but the latitudes and longitudes of the geotagged photos associated with that location’s WOE ID. No GIS information was harmed in the creation of these shapes.

How cool is that?!

How does it work?

The short version is: Scary and complicated maths. The longer version is: We are generating alpha shapes using the set of unique latitudes and longitudes associated with a WOE ID. The long version, to quote Tran Kai Frank Da and Mariette Yvinec, is:

“Imagine a huge mass of ice-cream making up the space … and containing the points as hard chocolate pieces. Using one of those sphere-formed ice-cream spoons we carve out all parts of the ice-cream block we can reach without bumping into chocolate pieces, thereby even carving out holes in the inside (eg. parts not reachable by simply moving the spoon from the outside). We will eventually end up with a (not necessarily convex) object bounded by caps, arcs and points. If we now straighten all round faces to triangles and line segments, we have an intuitive description of what is called the alpha shape…”

(There are also some useful illustrations of what that all means on Francois Belair’s Everything You Always Wanted to Know About Alpha Shapes But Were Afraid to Ask website.)

The community of authority and the authority of community

This is the important part: Many, if not most, of these shapes will look a little weird. Possibly even “wrong”. This is both okay and to be expected for a few reasons, at least.

  • Sometimes we just don’t have enough geotagged photos in a spot to make it is possible to create a shapefile. Even if we do have enough points to create a shape there aren’t enough to create a shape that you’d recognize as the place where you live. We chose to publish those shapes anyway because it shows both what we know and don’t about a place and to encourage users to help us fix mistakes.

    If the shape of the neighbourhood is incomplete or looks weird why not consider organizing a photowalk around its edges and when you get home add them to the map. The next time we generate a new shapefile for that neighbourhood it should look more like the place you recognize!

  • We did a bad job reverse geocoding photos for a particular spot and they’ve ended up associated with the wrong place. We’ve learned quite a lot about how to do a better job of it in the last two years but there is a limit to how much human awareness and subtlety we can codify. Sometimes, the data we have to try and work out what’s going on is just bad or out of date.

    Ultimately, that’s why we added the tools to help users correct their geotagged photos so that we can adjust things to their understanding of the world and begin to map facts on the ground rather than from on high.

  • We are not very sophisticated yet in how we assign the size of the alpha variable when we generate shapes. As far as we can tell no one else has done this sort of thing so like reverse-geocoding we are learning as we go. For example, with the exception of continents and countries we boil all other places down to a single contiguous shape. We do this by slowly cranking up the size of the ice cream scoop which in turn can lead to a loss of fidelity.

    Does the shape of Florida, or of Italy, include the waters that lie between the mainland and the surrounding islands? It’s not usually the way we imagine the territory that a place occupies but I think it probably does. On the other hand, including the ocean between California and Hawaii as part of the United States would be kind of dumb.

  • Are any of these the correct decisions? We’re not sure yet.

A concrete example to illustrate the point. Here are two versions of the island of Montreal, each generated from the same set of coordinates. The version on the left used an alpha number (an ice cream spoon) large enough to ensure that only a single shape was created compared with the version on the right that allowed for two shapes.

 

What’s going on, then? All those photos taken in St. Jean-sur-Richelieu (20 minutes south of Montreal) were added to the map back when we first added geotagging to the site and the information about the province of Quebec was not as detailed as what we have now. Ultimately, we decided to include the version on the left because as Matt Jones recently said:

The long here that Flickr represents back to me is becoming only more fascinating and precious as geolocation starts to help me understand how I identify and relate to place. The fact that Flickr’s mapping is now starting to relate location to me the best it can in human place terms is fascinating – they do a great job, but where it falls down it falls down gracefully, inviting corrections and perhaps starting conversation.”

As with any visualization of aggregate data there are likely to be areas of contention. One of the reasons we’re excited to make this stuff available is that much of it simply isn’t available anywhere else and the users and the developer community who make up Flickr have a gift for building magic on top of the API so we’re doubly-excited to see what people do with it.

That said please remember that this it is an aggregate view of things and should be treated more like the the zeitgeist of a place and not as capital-C confirmed facts on the ground or our taking sides in any conflicts (real, imagined or otherwise) between friends and neighbours.

These are not maps you should use to guide your spaceship back to Earth but they’re probably good enough to explore the possibilities.

Clustr

UR DOTZ I HAS DEM

Meanwhile, back in the nuts-and-bolts department: The actual alpha shapes are generated by a program called Clustr, written for us by the fantabulous Schuyler Erle.

Clustr is a command-line application written in C++ that takes as its arguments the path to a file containing a list of points (the hard chocolate pieces) and an alpha parameter (the ice cream spoon) and generates a shapefile describing the contour (the alpha shape) of that list. Anecdotally, we’ve seen Clustr plow through a file with four million unique coordinates (representing the continental United States, Alaska and Hawaii) in under three minutes on some pretty modest hardware.

The shapedata for a WOE ID is available via the Flickr API using the flickr.places.getInfo method.

Not all places have shape data yet so the root <place> element contains a has_shapedata attribute for checking at a glance. Otherwise you can test for the presence of a <shapedata> element. It will look like this:

<place place_id="4hLQygSaBJ92" woeid="3534"
	latitude="45.512" longitude="-73.554"
	place_url="/Canada/Quebec/Montreal" place_type="locality"
	name="Montreal, Quebec, Canada"
	has_shapedata="1">
	
   <!-- all the usual places hierarchy elements -->

   <shapedata created="1223513357" alpha="0.012359619140625"
      count_points="34778" count_edges="52">
      <polylines>
         <polyline>
            45.427627563477,-73.589645385742 45.428966522217,-73.587898254395, etc...
         </polyline>
      </polylines>
   </shapedata>

</place>	

Sometime next week, we will also include links to a real live ESRI shapefile, the well-known and mostly-loved lingua franca of the GIS community, for each WOE ID. They were supposed to be included with this release but because of a last minute glitch they need to be prettied up a little first. We think that the inclusion of the polylines will be enough to keep people busy until then. Shapefiles will be included, in the API, as a link to a compressed file you can download separately. For example:

Update: The first round of (ESRI) shapefiles have been reprocessed and are now available via the API. Shapefiles are included as a link to a compressed file you can download separately. For example:

    <urls>
       <shapefile>
          http://farm4.static.flickr.com/9999/shapefiles/3534_20081020_S33KR3TSHAPE.tar.gz
       </shapefile>
    </urls>

Our plan is to generate new renderings on a relatively constant basis, something like every month or two, though we haven’t firmed up any of those details yet. We’ll post about them here or on the API mailing list as things are worked out.

But wait, there’s more!

Along with the shape data the source code for Clustr is available in the Flickr Code repository and through our trac install, distributed under the GPL (version 2).

Clustr has two major dependencies not included with the source that you will need to install yourself in order to use. They are the Computational Geometry Algorithms Library (CGAL) and the Geospatial Data Abstraction Library (GDAL). Both are relatively straightforward to install on Linux and BSD flavoured operating systems; Windows and OS X are still a bit of a chore.

You probably won’t be able to download Clustr and simply plug it in to your awesome web-application today but I am hopeful that in time the community will develop higher level language (Perl, Python, Ruby, you name it…) bindings to make it easier and faster to write tools that build on the work we’ve done so far.

photo by Julian Bleecker

By the way, there is a still a known-known bug in Clustr rendering interior rings (the donut holes where there are no geotagged photos) in shapefiles. Specifically, they holes are rendered as actual polyline records. You can see an example of the problem in the screenshot of the shapefile for North America, above. We hope to have a proper patch in place by the time we make the ESRI files available next week. As it is since the problem only manifests itself for countries and continents it seemed like a reasonable thing for a version 0.1 release.

Update: Clustr 0.2, with a fix for errant interior rings, has been checked in to the code.flickr.com SVN repository.

Finally, these are early days and this is very much a developer’s release so we look forward to your feedback and also hope you will be understanding as we learn our way around the gotchas and quirks that will no doubt pop up.

In other geostuffs

In other geostuffs, we have enabled Open Street Maps tiles for two more cities: Baghdad and Kabul and George has written a fantastic post highlighting some of the photos we’ve found in both cities so go and have a look.

Picture 16

Map data CCBYSA 2008 OpenStreetMap.org contributors

Enjoy!

KrazyDad » Sunrise, Sunset

KrazyDad who kicked off our developer 5 questions series the other week, has a post up looking at animated visualizations of sunset and sunrise.

If you pay close attention to the differences between the two photos, particularly Florida and the Great Lakes, you’ll see that people prefer to photograph the sun sinking (or rising) behind a body of water. Hence eastern-facing coastlines tend to accumulate more sunrise photos, and western-facing coastlines tend to accumulate sunset photos.KrazyDad » Sunrise, Sunset

Counting & Timing

Here at Flickr, we’re pretty nerdy. We like to measure stuff. We love measuring stuff. The more stuff we can measure, the better our understanding of how different parts of the website work with each other gets. There are two types of measurement we especially like to do – counting and timing. These exciting activities help us to know what is happening when things break – if a page is taking a long time to load, where is that time being spent and what task have we started to do more of.

Counting

Our best friend, when it comes to stats collection and display is RRDTool, the Round-Robin Database. RRD is a magical database that uses a fixed amount of space for storing stats over time. It does this by storing data in decreasingly lower resolution as time goes by, keeping high resolution per-minute data for the last few hours, a lower per-day resolution data for times last year (the resolution is actually configurable). Since you generally care less about detailed data in the past, this works really well. RRD typically works in two modes – you feed it a number every so often and that number can either be a gauge (such as the current temperature) or a counter (such as the number of packets sent through a switch port). Gauges are easy, but RRD is clever enough to know how counters change over time. It then lets you graph these data points in interesting ways.

Counter - Basic RRD

The problem with RRD is that you often want to count some kind of event that doesn’t have a natural counter associated with it. For instance, we might want to count how many times we connect to a certain cache server from a particular piece of code, so that we can graph it. We know when we connect to it in the code, but there’s no global number tracking that count. We need some central counter which we increment each time we connect, which we can then periodically feed into RRD. This tends to get quickly further complicated by having multiple machines on which the code runs – every one of Flickr’s web servers will run this code, but we care about how many times they connect to the cache server in total.

The easiest way to do this is to use a database. When we connect to the cache server, connect to the database and increment a counter. Some other job can then periodically connect to the database, read the counter and store it in RRD. Simple! However, the operation of connecting to the database and performing the write does not have zero cost. It takes some amount of time which may appear trivial at first, but doing that several hundred times a second (lots of the operations we want to track happen very often), multiplied by hundreds of different things we want to track quickly adds up. It’s a little bit quantum theory-ish – the act of observing changes the observed reality. When we’re counting very fast operations, we can’t spend much time on the counting.

There’s luckily an operation which we can perform that’s cheap enough to do very often while allowing us to easily collect data from multiple servers at once: UDP. The User Datagram Protocol is TCP’s ugly kid brother. Unlike TCP, UDP does not guarantee delivery of packets, nor the order in which they’ll be received. In return, they’re a lot faster to send and use less bandwidth. When counting frequent stats, losing the odd packet doesn’t really hurt us, especially when we can also graph when we’ve been dropping packets (under Linux, try netstat -su and look for “packet receive errors”). Changing the UDP buffer size (kernel variable net.core.rmem_max) allows you to queue more packets before processing them, if you find you can’t keep up.

In our client code, this is very easy to do. Every language has a different approach (of course!), but they’re all straight forward.

Perl

	my $sock = new IO::Socket::INET(
			PeerAddr	=> $server,
			PeerPort	=> $port,
			Proto		=> 'udp',
		) or die('Could not connect: $!');

	print $sock "$valuen";

	close $sock;

PHP

	$fp = fsockopen("udp://$server", $port, $errno, $errstr);
	if ($fp){
		fwrite($fp, "$valuen");
		fclose($fp);
	}

The trickier part is the server component. We need something that will receive packets, increment counters and periodically save the counter to RRD files. At Flickr we developed a small Perl daemon to do this, using an alarm to save the counters to RRD. The UDP packet simply contains the name of the counter to increment. We can then start to track interesting operations over time:

Counter - Simple

The next stage was to change our simple counters into ones which recorded state. We’re performing a certain operation 300 times a second, but how often is it failing? RRD allows you to store multiple counters in a single file and then graph them together in a variety of ways, so all we needed was a little modification to our collection daemon.

Counter - Failures

The red area on the left shows some failures. Some tasks have very few relative failures, but which are still important to see. That’s easy enough, but just producing two graphs with different Y-axis scales (something that RRD does auto-magically for us).

Counter - Small Failures

Timing

Counting the number of times we perform a certain task can tell us a lot, but often we’ll want to know how long that task took. The simplest way to do this is to perform a task periodically and graph the result. This is pretty simple and something that we use Ganglia for to good effect.

Timings - Simple

The problem with this is that it just tells us the time it took to perform our single test task. This is useful for looking at macro problems, but is useless against tasks that can take a variable amount of time. If a database server is processing one in ten tasks slowly, then that will appear on this graph as a spike (roughly) every ten samples – it’s impossible for us to tell the different between a server that processes 10% of requests slowly and a sever that has periodic slow periods. We could take an average time (but storing an accumulator along with the counter, and dividing at the end of each period) and this gets us closer to useful data.

What we really need to know, however, is something like our standard deviation – both what our average is and where the bulk of our samples lie. This allows us to distinguish between a request that always takes 20ms and one that takes 10ms half of the time and 30ms half of the time. To achieve this, we changed our collection daemon to keep a list of timings. At the end of each sampling period (somewhere between 10s and 60s, depending on the thing we’re monitoring), we then sort these times, find the mean and the first and third quartiles. By storing each of these values (along with a min and max) into RRD every period, we can show the average timing along with the tolerance.

Timings - Quartiles

So our task takes 250ms on average, with 75% finishing within ~340ms and 25% finishing within ~190ms. The light green shaded area shows the lowest 25%. We don’t shade the upmost 25%, since random spikes can cause the Y-axis to become to large that it makes the graph difficult to read. We don’t have this problem with the lowest quartile, since nothing can be faster than zero (and we always show zero on the Y-axis to avoid scale bias).

Bringing it all together

The next step is to tie together the counting with the timing, to allow us to see how the rate of an operation effected the time which it took to perform. By simply lining up graphs below each other, we can easily see relationships between these different measures:

Timings - Correlate Simple

Of course, we like to go a little measurement-crazy, so we tend to sample as many things as we can to look for patterns. This graph from our Offline Task System shows how often we run a particular task, how long it took to execute but also how long it was queued before we ran it.

Timings - Correlate

The code for out timing daemon is available in the Flickr Code repository and through our trac install, so you can have a play with it yourself. It just requires RRDTool and some patience. Get graphing!

Lessons Learned while Building an iPhone Site

The Explore Page in the iPhone site

A few weeks ago we released a version of the Flickr site tailored specifically for the iPhone. Developing this site was very different from any other project I’ve worked on; there seems to be a new set of frontend rules for developing high-end mobile sites. A lot of the current best practices get thrown out the window in the quest for minimum page weight and fastest load times over slow cellular connections.

Here are a few of the lessons we learned (sometimes painfully) while developing this site.

1. Don’t Use a JavaScript Library or CSS Framework

This was one of the hardest things for me to come to terms with. I’m a huge fan of libraries, especially YUI, mostly because they allow me to spend my time creating new stuff instead of working around crazy browser quirks. But these libraries walk a fine line; by definition, they must work across a wide array of browsers and offer enough features to make them worth using. This means they potentially contain a lot of code that you don’t care about and won’t use. This code is dead weight to your site.

With such a high percentage of normal web users on broadband connections, we’ve gotten cavalier about what we can include in our pages. 250 KB of JavaScript or more isn’t uncommon for a large site these days. But for sites that are meant to be viewed over slow cellular connections like EDGE, 250 KB is an impossible amount of data. The only way to get the size of your JavaScript down is to selectively pull code out of libraries, and include only what you use. This means you can rip out code meant only for browsers that you won’t support (modular libraries like the new YUI 3.0 allow you to only include the code you use, preventing this problem somewhat).

The same goes for CSS. Frameworks make development faster and your final product more robust, but they, like the JavaScript libraries, include code for situations you won’t have to deal with. Every line in your CSS must be custom; each property must be scrutinized to ensure it’s needed.

2. Load Page Fragments Instead of Full Pages

Loading fragments saves 92.2% of the page size

When navigating through a site, most of what changes from page to page is the actual content; the JavaScript, CSS, header and footer stay mostly the same. We can use this to our advantage by only loading the part of each page that changes. We did this by hijacking all links of the page: when a link is clicked, we intercept the event, fetch the page fragment using Ajax, and insert the HTML into a new div. This has several benefits:

  • Since you control the entire life cycle of the page fetch, you can display loading indicators or a wireframe version of the page while new pages load
  • All pages that have been fetched will exist within the DOM; clicking the back button (or clicking on a link for a page that has already been fetched) results in an instantaneous page load
  • The page fragments are extremely small; ours are about 800 bytes (gzipped) on average

Using this system complicates your code a bit. You need JavaScript to handle the hijacking, the page fragment insertion, and the address bar hash changes (which allow the back and forward buttons to work normally). You also need your backend to recognize requests made with Ajax, and to only send the page content instead of the full HTML document. And lastly, if you want normal URLs to work, and each of your pages to be bookmarkable, you will need even more JavaScript.

Despite these downsides, the benefits can’t be ignored. The extra JavaScript code is a one-time cost, but the extra page content that we would have downloaded is saved for every page load. We found it was worth the complication and additional JS in order to dramatically reduce the time it took to load each page.

3. Don’t Build for Just One Device

It’s really tempting to build the site for just the iPhone: you can use modern CSS (including things like CSS3 selectors and transformations), you don’t have to hack around annoying browser quirks, and testing is extremely easy. But any single device, even one as ubiquitous as the iPhone, has a limited share of the mobile market, especially internationally. Rarely can you justify the cost of creating a one-off site for a very small number of your users.

Luckily the current generation of high-end mobile browsers is excellent in terms of support for modern features. Many phones use a WebKit derivative, including the iPhone, and Symbian and Android phones. Other phones either come with or can use Opera Mobile or the new mobile version of Firefox (called Fennec). For the most part, very few changes are needed in order to support these browsers.

Most of the differences lie in layout. It’s important to structure your pages around a grid that can expand as a percentage of the page width. This allows your layouts to work on many different screen sizes and orientations. The iPhone, for example, allows both landscape and portrait viewing styles, which have vastly different layout requirements. By using percentages, you can have the content fill the screen regardless of orientation. Another option is to detect viewport width and height, and use JavaScript to dynamically adjust classes based on those measurements (but we found this was overkill; CSS can handle most situations on its own).

4. Optimize Everything

The browsers on mobile devices operate under much stricter constraints than their desktop cousins. Slower CPUs, smaller amounts of memory, and smaller hard drives mean that less data can be cached. On the iPhone, for instance, only files smaller than 25 KB are cached. This puts very specific limits of the size of your files. For a large site like Flickr, 25 KB worth of JavaScript and CSS barely scratches the surface. To put our files under the limit, we ran everything through the YUI Compressor using the most aggressive settings. We ran all images through compression tools as well (we like pngout and Smushit), reducing each image file by an average of 40%. We also made heavy use of sprites, where possible.

In the end, we were able to go from 90+ second load times over EDGE to less than 7 for an empty cache experience. Using page fragments, we are able to load and display new pages in under a second (though the images in those pages take longer to load). These are not trivial gains, and make the difference between a good mobile experience and a one that is so awful the user gives up halfway through the page load.

5. Tell the User What is Happening

Once we hijacked all clicks actions in order to load page fragments, it wasn’t always clear to the user if anything was happening when they clicked on a link. This is especially true on touch devices, where it is difficult to know if the device even detected your action. To combat this problem, we added loading indicators to every link. These tell the user that something is happening, and reassures them that their action was detected. It also makes the pages seem to load much faster, since something is happening right away; if the indicators weren’t there, it would seem like nothing was happening for a few seconds, and then the page would load suddenly. In our testing, these indicators were the difference between a UI that seems snappy and responsive, and one that seemed slow and inconsistent.

Loading indicators

One Easy Option

The iUI framework implements a lot of these practices for you, and might be a good place to start in developing any mobile site (though keep in mind it was developed specifically for the iPhone). We found it especially useful in the early stages of development, though eventually we pulled it out and wrote custom code to run the site.

Kitten Tuesday, ASCII Edition

Flickr Asciified - Nihilogic

Jason over at Nihilogic has a post up called Getting your ASCII on with Flickr

“You simply enter the search query of your choice and click the “Asciify” button. A request is then sent off to the Flickr server and any photos that match the query are returned. The image data is analyzed and turned into ASCII characters.”

Obviously I made a Kitten and you too can get flickrasciified over here.

Orginal Photo by pontman.

A Little About The Flickr Bike


A sum of its parts @ Yahoo! Video (for ppl using RSS readers)

Shanan’s been driving me insane riding the Flickr Bike around the office. If you don’t know about the Flickr Bike, in a nutshell we have 20 purple bikes tricked out with Nokia N95s geotagging photos and uploading them to Flickr (of course) as people cycle round, there’s a video (Flickr’s handlebar cameras) over on cnet, and a frankly awesome write up on Lifehacker.

This being the Dev Blog we’re interested in looking under the hood, to use an awful metaphor, and finding out how things work. Fortunately for us Josh wrote up a blog post last week, including links to source code and all that good stuff, meaning I can simple point to that, Huzzah for the internet!

Read techy stuff over at ~~> Coding a Networked Bike

If you’re more of a moving pictures with sounds than words person, then Lifehacker has 6 videos up that are worth a watch: The Making of the Flickr Bikes. The seconds one of which I included at the start of this blog post.

Oh and fyi, they’re Electra bikes painted #7B0099 (Pantone 2602 C).

5 Questions for Jim Bumgardner

We’ve been keeping a careful eye on our Sister Blog to see what they’re up to. Something that’s particularly caught our eye is "5 Questions ", asking the same 5 questions to the Flickrverse, with the last question being who we should ask next. And so, we hope, it goes on and on.

Banana-bub

This is our version, asking questions of those that develop, hack and fiddle with Flickr in new and interesting ways. Of course we couldn’t start with anyone else but KrazyDad (aka Jim Bumgardner).

Jim founded the Flickr Hacks group back in the day, a great place to hang out and ask question if you want to learn how to bend Flickr to your will. In 2006 he also coauthored the Flickr Hacks book for O’Reilly and happily for us he hasn’t stopped tinkering with Flickr yet.

So, without any further ado, 5 Questions for Jim Bumgardner:

1. What are you currently building that integrates with Flickr, or a past favorite that you think is cool, neat, popular and worth telling folks about? Or both.

Jim: It seems like I’m always building something that integrates Flickr. A recent favorite is this interactive mosaic that shows the most interesting photos of the week.

The photos are arranged to form a spiral, a form that appears quite frequently in my work.

www.coverpop.com/pop/flickr_interesting
Coverpop: Most Interesting Photos of the Week

I have a "cron job" which runs on one of my computers at home, which updates this mosaic every week, so the photos in it are always fresh. Incidentally, I prefer to call this process a "cron joy." Oh, nerd humor…

2. What are the best tricks or tips you’ve learned working with the Flickr API?

Jim: I think every Flickr hackr should have access to a powerful high level
graphics library. My library of choice is ImageMagick combined with the Perl programming language (it also works nicely with Ruby), but the GD library, which works with various languages, and PIL, for Python, are also good.

I not only use ImageMagick for building mosaics and graphs, but also for "under the hood" kinds of things, like measuring the average colors of photos for the Colr Pickr (see below).

3. As a Flickr developer what would you like to see Flickr do more of and why?

Jim: One of the very first Flickr hacks I made was the Colr Pickr

www.krazydad.com/colrpickr/

…which allows photos to be selected by color. Since that appeared, I’ve worked on, and seen some fancier variations on the concept, that allow larger quantities of Flickr photos to be selected using multiple colors. But all these systems, require that thousands or even millions of thumbnails be downloaded and analyzed for color. This is because Flickr does not supply "average color" information in its APIs, and cannot provide the color search functionality that this data would enable.

flickr Colr Pickr

I would like to see Flickr provide, via it’s APIs, the three most common colors in each photo (using cluster analysis), and provide a way to search for photos which match one, two, or three colors. These parameters, similar to geocode searches, would need to be combined with some other search parameters, such as tags, to narrow
the field down.

A feature like this would be a godsend to designers. I’ve got sample code for the color analysis, if anyone’s interested… :)

4. What excites you about Flick and hacking? What do you think you’ll build next or would like someone else to build so you don’t have to?

Jim: One thing that excites me is the ability to access large quantities of photos that contain valuable metadata, such as the time the photo was taken, or the geocoded location. I used the ‘date taken’ data to construct this very cool graph of sunsets:

A year of sunsets

While most digital cameras store the time within photos, these days, not enough of them automatically store the location. We have to rely on photographers adding the geocoded information manually, and sadly, not enough of them are geeky enough to do it. I’m looking forward to the day, a few years from now, when most of the new photos on Flickr will also contain geocoded information, as this will enable me to make apps which enable a kind of instant photo-journalism of heavily photographed events, such as rallies and parades. We’re seeing the beginnings of these kind of apps now, but we’re barely scratching the surface.

5. Besides your own, what Flickr projects and hacks do you use on a regular basis? Who should we interview next?

mc-50 map of FlickrLand: flickr's social network

Jim: GustavoG has made some amazing graphs which exploit and illustrate the Flickr social network.

Dan: Thank you, Jim. Next up for our thrilling installment of 5 Questions, GustavoG .

Images from krazydad / jbum , earthhopper and GustavoG.

Flickr Digs Ganglia

A number of people have asked us: where did the pretty graphs from last week’s post come from?

My Day

The answer: Ganglia.

It takes a lot of hardware and software to make a site like Flickr run smoothly. The operations team is responsible for scaling up our monitoring platform to collect all of the metrics we need to make informed decisions about where and when to add new hosts and how urgently, understand how different types of hardware perform with similar real life workloads, determine bottlenecks, and more. Flickr uses Ganglia to collect, aggregate, and visualize those metrics.

So, what is Ganglia? Briefly, "Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and grids"* originally developed at the University of California, Berkeley. Ganglia is typically run by administrators of High Performance Clusters, large groups of machines working together to complete tasks. While we have some machines organized into the traditional cluster configuration, for example for log crunching, we simply define a cluster under Ganglia as a group of machines that do similar things but don’t necessarily interact with one another. For example, all of our web servers in each site are one cluster and all caches another. Our databases are broken up into multiple clusters based on functionality. This way, boxes that should be doing the same kind of work can be easily compared against one another.

Once Ganglia is up and running you’ll see a number of system level statistics reported by each node. You can also easily report custom metrics and have them appear on graphs along with the built in metrics. Before the latest release (the 3.1.x line), this meant some code that calls gmetric and entry in the crontab for scheduling execution of that code. Ganglia 3.1 offers an additional facility for injecting custom metrics that easy to use and offers some additional power and flexibility over the gmetric + cron approach.

For more information on how Ganglia works, see the excellent references on the Ganglia Wiki . The community is active and very helpful.

If you’re just getting started with Ganglia, here are some pointers to save you headaches later:

  • Don’t use the default multicast address 239.2.11.71 for any clusters. You will start gmond without a config file and it will join that first cluster you defined and you will not be happy that your summary graphs are messy.
  • Do put your RRDs on a ramdisk/tmpfs. Your disk(s) will thank you. Don’t forget to setup a job to sync those files to some persistent storage periodically – we do it every 10 minutes.
  • Do consider the frequency of data collection and how long you need to keep the data around (and at what resolution). Do you need to know what the 5 minute load average was on your hosts one year ago today (maybe in relation to other metrics)? Do you need to know how many uploads per second those hosts were doing then (certainly)? While individual metric stores can be resized to keep data around for various amounts of time, typically it’s easier to find the least common denominator – the largest amount of time you think you need this information for – and set that in gmetad.conf. The default gmetad config stores data as:
    • 15 second average for about an hour
    • 6 minute average for about a day
    • 42 minute average for about a week
    • 2 hour 48 minute average for about a month
    • 1 day average for about a year

Once you’re up and running with Ganglia you’ll have access to something like the graph from the last post, what we call a stacked graph. Stacked graphs are intended to provide a quick overview of the relative performance of each host in a cluster. The code has been submitted to the Ganglia project for all to use.

Check out http://ganglia.info for more information and stay tuned for more posts about how Flickr uses Ganglia.

What’s in a Resource?

“Flickr is incredibly lucky, because the core of what we do is focused around images.”

If you’ve ever heard me talk about Internationalizing Flickr, you’ve probably heard me say those words. And it’s true – more than almost any other website, we deal primarily with content which is language-agnostic (and, to a great extent, culturally agnostic).

No matter where we live or what language we speak, our reactions to, say, fluffy kittens have a remarkably similar range (being, approximately, “aww”, “ew” or “achoo!”…)

When we first began to define what Flickr’s international incarnation would look like, our primary concern was preserving this global, cross-cultural feeling – a sense that our members’ photos and our visitors come from all over the world, but that the images on Flickr can “speak” to anyone.

It’s for that reason that Flickr isn’t divided into national silos – there’s no Flickr France, Flickr Brazil or Flickr USA. Much like Highlander, there can be only one Flickr, and it happens to be accessible in a multi-lingual interface.

All well and good, so far.

But the biggest issue I wrestled with (and occasionally still do) was what to do with the site’s URLs. The structure we planned for the site required us to take a definite position on a philosophical issue – what, in actual fact, constitutes the “Resource” defined by the Uniform Resource Locator?

There are two possible schools of thought on this – one which would argue that the photo page http://www.flickr.com/photos/hitherto/257018778/ with a French interface is materially different to the same page when presented with an English interface. The French page, we might argue, should be named http://fr.flickr.com/photos/hitherto/257018778/, http://www.flickr.com/fr-FR/photos/hitherto/257018778/ or something similar.

On the other hand (and, in fact, the hand we eventually chose), the real “resource” here is the photo (and associated metadata) which the page points to – the interface is immaterial to the content we’re presenting. The big advantage of this approach, especially in a multi-lingual world, is that everyone gets the experience best suited to them.

A Korean friend, for example, can send me a link to a photo and I will see it on Flickr with the English interface familiar to me. She, meanwhile, can happily browse Flickr in Korean.

Things perhaps get murkier if we consider other areas of the site – the FAQs at http://flickr.com/help/faq/, for example. Here, all the content is provided by Flickr, and since all of it is in a particular visitor’s chosen language, the entire resource is different.

Even here, though, the principle can be made to apply. If an English-speaking German friend asks where they can get help on Flickr, I don’t have to know which language they prefer to view the site in; I can just point them to the FAQ page, and they will have access to the resource (“help with Flickr”) which they needed.

Now, admittedly, working for a large multi-national company puts me in contact with more than my fair share of people who speak multiple languages, so maybe this matters more to me than to most people. But as our society grows more connected and more mobile, I like to think that the incidences of these kinds of cross-cultural exchanges will only grow.

The biggest downside to our current URL approach comes when search engines try to index our content. Since we don’t have language-specific URLs (and search engine crawlers aren’t in the habit of changing their language settings and retaining the necessary cookies), everything which search engines index on Flickr comes from our English-language interface.

As it happens, depending on how smart the search engines are feeling, this isn’t too much of a problem – we do try to surface photo titles and descriptions so that the abstracts make sense. Still, the results returned by Yahoo! and Google for “Buzios” (a beach resort peninsula near Rio in Brazil) give some idea of the nature of the problem.

Occasionally, when I’m hit by a case of perfectionism, such things keep me awake at night. And I’m sure that someone, somewhere is wailing and gnashing their teeth over how “Flickr are doing URLs wrong”.

All in all, though, I think we made the right decision.