5 Questions for Simon Willison

DSC00838.JPG

Simon Willison kindly took time out of his llama-spotting, Python wrangling (Django co-creating), MP expenses crowdsourcing day, to answer a few questions about he and his co-conspirator Natalie Downe‘s latest journalistic foray.

1. What are you currently building that integrates with Flickr, or a past favorite that you think is cool, neat, popular and worth telling folks about? Or both.

Simon: Our main project at the moment is WildlifeNearYou.com. It started life as a holiday hacking project with twelve geeks, on a Napoleonic Sea Fort, in the channel islands, half way between England and France, with no internet connection (see devfort.com for background info). Natalie and I continued to work on it after we returned to civilisation and we finally put it live last month. 170 people have imported more than 6,500 photos from Flickr in just the past few weeks, so people seem to like it!

The site’s principle purpose is to help people see wildlife – both in the wild and in places like nature reserves or zoos. We ask people to report wildlife they have seen by adding trips, sightings and photos – they get to build up their own profile showing everything they’ve spotted, and we get to build a search engine over the top that can answer queries like llamas near brighton or otters near san francisco.

In addition to that core functionality, we have a couple of fun extra features based around people’s photos. Our users can import their pictures from Flickr, and use WildlifeNearYou’s species database (actually sourced from Freebase.com) to tag those photos. We then push the tags back to their Flickr account in the form of text tags and machine tags – if they tell us the location (e.g. London Zoo) we’ll geotag their photo on Flickr as well.

If they don’t know what the animal in the photograph is, other users of the site can suggest a species. If the owner of the photograph agrees with the suggestion the species will be added to their list of sightings and the correct tags applied (and pushed through to Flickr).

Finally, we have a really fun crowdsourcing system for identifying and rewarding the best photos. If you go to http://www.wildlifenearyou.com/best/ we’ll show you two photos of the same species and ask you to pick the best one. Once we’ve collected a few opinions, we award gold, silver and bronze medals to the top three photographs for each species. The best photo is then used as the thumbnail for that species all over our site. Here are our best giraffe photos, as voted by the community:

http://www.wildlifenearyou.com/animals/giraffe/photos/best/

2. What are the best tricks or tips you’ve learned working with the > Flickr API?

Simon: It really is amazing how much benefit we got out of pushing machine tags over to Flickr. One feature we got for free was slideshows – once you have a machine tag, it’s easy to compose a URL which will present all of the photos tagged with that machine tag as a slideshow on Flickr. Here’s one for all of our photos of London Zoo:

http://www.flickr.com/photos/tags/wlny:place%3Dp1/show/

Even better, our site can now be queried using the Flickr API! Here’s a fun example: finding all of the Red Pandas that have been spotted in Europe. WildlifeNearYou doesn’t yet have a concept of Europe, but Flickr’s API can search using Yahoo! WhereOnEarth IDs. We can find the WOEID for Europe using the GeoPlanet API:

http://where.yahooapis.com/v1/places.q(‘europe’)?appid=[yourappidhere]

The WOEID for Europe is 24865675. Next we need the WildlifeNearYou identifier for red pandas, so we can figure out the correct machine tag to search for. We re-use the codes from our custom URL shortener for our machine tags, so we can find that ID by looking for the “Short URL” link on http://www.wildlifenearyou.com/animals/red-panda/ (at the bottom of the right hand column). The short URL is http://wlny.eu/s2f which means the machine tag we need is wlny:species=s2f

Armed with the WOEID and the machine tag, we can compose a Flickr search API call:

http://api.flickr.com/services/rest/?method=flickr.photos.search&machine_tags=wlny:species=s2f&woe_id=24865675&api_key=…

That gives us back a list of photos of Red Pandas taken in places that are within Europe. Add &extras=geo,url_s,tags to get back the tags, latitude/longitude and photo URL at the same time. The wlny: machine tags that come back can be used to link back to the place, species and trip pages – for example, wlny:place=p6p means the photo was taken at the place linked to by http://wlny.eu/p6p

This is pretty powerful stuff, and it’s all a natural consequence of writing machine tags back to Flickr.

(editors note: or even a slideshow of European red pandas)

3. As a Flickr developer what would you like to see Flickr do more of and why?

Simon: One thing that would make my life an enormous amount easier would be a Flickr-hosted photo picking application. For WildlifeNearYou I had to build a full interface for selecting photos from scratch, with options to search your photos, browse your sets, browse photos by place and so forth. The Flick API makes this pretty easy to do from a back end code perspective, but designing and implementing a pleasant front end is a pretty major job.

I’ve wanted to implement simple photo picking from Flickr on various other projects, but have been put off by the effort involved. WildlifeNearYou is the first time I’ve actually taken the challenge on properly.

What I’d love to see instead is an OAuth-style flow for selecting photos. I’d like to (for example) redirect my users to somewhere like http://www.flickr.com/pickr/?return_to=http://www.wildlifenearyou.com/selected/ and have Flickr present them with a full UI for searching and selecting from their photos. Once they had selected some photos, Flickr could redirect them back to http://www.wildlifenearyou.com/selected/?photo_ids=4303651932,4282571384,4282571396 and my application would know which photos they had selected.

This would make integrating “pick a photo / some photos from Flickr” in to any application much, much easier.

On a less exotic note, we have to do quite a few bulk operations against Flickr and having a bulk version of the flickr.photos.getInfo call would make this a lot faster – just the ability to pass up to 10 photo IDs at a time would reduce the number of HTTP calls we have to make by a huge amount.

4. What excites you about Flick and hacking? What do you think you’ll > build next or would like someone else to build so you don’t have to?

Simon: We’re going to be working on WildlifeNearYou for quite a while – I certainly don’t expect to get tired of looking at people’s photos of wildlife for a long time. We have a bunch of improvements planned for the “best photo” feature – we want to start showing your medals on your profile, and maybe have a league table for the best photographers based on who has won the most “best picture of X” awards. Once we’ve improved our species and location data we should be able to break that down in to best photographer for a certain area, or even for a category of animal (best owl photographer is sure to be hotly contested).

We also want to streamline our “add trip” flow based on Flickr metadata. If you import a bunch of geotagged photos, we can guess that they were probably taken on a trip to London Zoo an the 5th of February based on the location and date information from the pictures.

As for other projects… I’d love it if someone else would build the general purpose photo picker idea above – it doesn’t necessarily have to be Flickr, anyone could provide it as a service.

5. Besides your own, what Flickr projects and hacks do you use on a regular basis? Who should we interview next?

Simon: Matthew Somerville’s work is always interesting, and his current project, Theatricalia, is a single-handed attempt to create an IMDB for theatre productions. Naturally, he’s pulling in photos from Flickr based on his own machine tags.

Ticket Servers: Distributed Unique Primary Keys on the Cheap

This is the first post in the Using, Abusing and Scaling MySQL at Flickr series.

Ticket servers aren’t inherently interesting, but they’re an important building block at Flickr. They are core to topics we’ll be talking about later, like sharding and master-master. Ticket servers give us globally (Flickr-wide) unique integers to serve as primary keys in our distributed setup.

Why?

Sharding (aka data partioning) is how we scale Flickr’s datastore. Instead of storing all our data on one really big database, we have lots of databases, each with some of the data, and spread the load between them. Sometimes we need to migrate data between databases, so we need our primary keys to be globally unique. Additionally our MySQL shards are built as master-master replicant pairs for resiliency. This means we need to be able to guarantee uniqueness within a shard in order to avoid key collisions. We’d love to go on using MySQL auto-incrementing columns for primary keys like everyone else, but MySQL can’t guarantee uniqueness across physical and logical databases.

GUIDs?

Given the need for globally unique ids the obvious question is, why not use GUIDs? Mostly because GUIDs are big, and they index badly in MySQL. One of the ways we keep MySQL fast is we index everything we want to query on, and we only query on indexes. So index size is a key consideration. If you can’t keep your indexes in memory, you can’t keep your database fast. Additionally ticket servers give us sequentiality which has some really nice properties including making reporting and debugging more straightforward, and enabling some caching hacks.

Consistent Hashing?

Some projects like Amazon’s Dynamo provide a consistent hashing ring on top of the datastore to handle the GUID/sharding issue. This is better suited for write-cheap environments (e.g. LSMTs), while MySQL is optimized for fast random reads.

Centralizing Auto-Increments

If we can’t make MySQL auto-increments work across multiple databases, what if we just used one database? If we inserted a new row into this one database every time someone uploaded a photo we could then just use the auto-incrementing ID from that table as the primary key for all of our databases.

Of course at 60+ photos a second that table is going to get pretty big. We can get rid of all the extra data about the photo, and just have the ID in the centralized database. Even then the table gets unmanageably big quickly. And there are comments, and favorites, and group postings, and tags, and so on, and those all need IDs too.

REPLACE INTO

A little over a decade ago MySQL shipped with a non-standard extension to the ANSI SQL spec, “REPLACE INTO”. Later “INSERT ON DUPLICATE KEY UPDATE” came along and solved the original problem much better. However REPLACE INTO is still supported.

REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.

This allows us to atomically update in a place a single row in a database, and get a new auto-incremented primary ID.

Putting It All Together

A Flickr ticket server is a dedicated database server, with a single database on it, and in that database there are tables like Tickets32 for 32-bit IDs, and Tickets64 for 64-bit IDs.

The Tickets64 schema looks like:

CREATE TABLE `Tickets64` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `stub` char(1) NOT NULL default '',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=InnoDB

SELECT * from Tickets64 returns a single row that looks something like:

+-------------------+------+
| id                | stub |
+-------------------+------+
| 72157623227190423 |    a |
+-------------------+------+

When I need a new globally unique 64-bit ID I issue the following SQL:

REPLACE INTO Tickets64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

SPOFs

You really really don’t know want provisioning your IDs to be a single point of failure. We achieve “high availability” by running two ticket servers. At this write/update volume replicating between the boxes would be problematic, and locking would kill the performance of the site. We divide responsibility between the two boxes by dividing the ID space down the middle, evens and odds, using:

TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1

TicketServer2:
auto-increment-increment = 2
auto-increment-offset = 2

We round robin between the two servers to load balance and deal with down time. The sides do drift a bit out of sync, I think we have a few hundred thousand more odd number objects then evenly numbered objects at the moment, but this hurts no one.

More Sequences

We actually have more tables then just Tickets32 and Tickets64 on the ticket servers. We have a sequences for Photos, for Accounts, for OfflineTasks, and for Groups, etc. OfflineTasks get their own sequence because we burn through so many of them we don’t want to unnecessarily run up the counts on other things. Groups, and Accounts get their own sequence because we get comparatively so few of them. Photos have their own sequence that we made sure to sync to our old auto-increment table when we cut over because its nice to know how many photos we’ve had uploaded, and we use the ID as a short hand for keeping track.

So There’s That

It’s not particularly elegant, but it works shockingly well for us having been in production since Friday the 13th, January 2006, and is a great example of the Flickr engineering dumbest possible thing that will work design principle.

More soon.

Belorussian translation provided by PC.

Using, Abusing and Scaling MySQL at Flickr

I like “NoSQL”. But at Flickr, MySQL is our hammer, and we use it for nearly everything. It’s our federated data store, our key-value store, and our document store. We’ve built an event queue, and a job server on top of it, a stats feature, and a data warehouse.

We’ve spent the last several years abusing, twisting, and generally mis-using MySQL in ways that could only be called “post relational”. Our founding architect is famously in print saying, “Normalization is for sissies.”

So while it’s great to see folks going back to basics — instead of
assuming a complex and historically dictated series of interfaces, assuming just disks, RAM, data, and problem to solve — I think it’s also worth looking a bit harder at what you can do with MySQL. Because frankly MySQL brings some difficult to beat advantages.

  • it is a very well known component. When you’re scaling a complex app everything that can go wrong, will. Anything which cuts down on your debugging time is gold. All of MySQL’s flags and stats can be a bit overwhelming at times, but they’ve accumulated over time to solve real problems.

  • it’s pretty darn fast and stable. Speed is usually one of the key appeals of the new NoSQL architectures, but MySQL isn’t exactly slow (if you’re doing it right). I’ve seen two large, commercial “NoSQL” services flounder, stall and eventually get rewritten on top of MySQL. (and you’ve used services backed by both of them)

Over the next bit I’ll be writing a series of blog posts looking into how Flickr scales MySQL to do all sorts of things it really wasn’t intended for. I can’t promise you these are the best techniques, they are merely our techniques, there are others, but these are ours. They’re in production, and they work. I was tempted to call the series “YesSQL”, but that really doesn’t capture the spirit, so instead I’m calling it “Using and Abusing MySQL”.

And the first article is on ticket servers.