A Chinese puzzle: Unicode and EXIF metadata parsing

Le Presque-Cube

At flickr, there are quite a few photos and you can browse the site in 8 different languages, including Korean and Chinese. Common metadata such as title, description and tags can be pre-populated based on information contained in the image itself, using what is commonly called EXIF information. So, it makes sense to implement this with respect to language and, above all, alphabet/character encodings. 

Un peu de lecture...?

Well, what made sense did not make so much sense anymore when using existing specifications. Here is how we coped with them.

The standards

To begin with, there is not just EXIF. Metadata can actually be written within a picture file in at least 3 different formats:
EXIF itself.
IPTC-IIM headers.
XMP by Adobe.

Of course, these formats are neither mutually exclusive nor completely redundant and this is where this can get tricky.

But let’s step back a moment to describe the specifics of these formats, not in details, but with regards to our need, which is to extract information in a reliable way, independently from how/when/where the image was created.

The EXIF is the oldest form

Old Cars Part 1
Hence, with the most limitations yet the most widespread, as this is often the case in our industry. Thus, we need to deal with it, even though it is radically flawed from an internationalization stand point: text fields are not stored using UTF and most of the time there is no indication of the character set encoding.

IPTC-IIM

In its later versions, it added the optional (Grr!!!) support for a field indicating the encoding of all the string properties in the container.

XMP

At last, text from XMP format is stored in UTF-8. On a side note, the XML based openness of this format is not making things easier, for each application can come up with its own set of metadata. Nevertheless, from an internationalization and structural standpoint, this format is modernly adequate: finally!

So, now that we know what hides behind each of these standards we can start tackling our problem.
Hide n Seek

A solution ?

Rely on existing libraries, when performant

For flickr Desktop client (Windows, Mac), we are using Exiv2 Image metadata library, which helps the initial reconciliation between all fields (especially with EXIF and IPTC-IIM contained within XMP).

The “guesstimation” of character set from EXIF

坐在问号足球上的3D小人高清图片_zcool.com.cn
We first scan the string to see if all bytes are in the range: 0 to 127. If so, we treat the string as ASCII. If not, we scan the string to see if it is consistent with valid UTF-8. If so, we treat the string as UTF-8. Checking against UTF8 validity is not a bullet-proof test. But statistically, this is better than any other scenario. At the last resort, we pick a “reasonable” fallback encoding. For the desktop application, we use hints from the user system. On windows, the GetLocaleInfo function provides with the user default ANSI code page (LOCALE_IDEFAULTANSICODEPAGE), which can be used to specify an appropriate locale for string conversion methods. On Mac OS X, CFStringGetSystemEncoding is our ally. In our case, there is no point to use the characterset of our own application, which we control and is not linked to the characterset of the application that “wrote” the EXIF.

Consolidation: the easy case

The workflow followed by the image we handle is unknown. We can have all 3 formats filled-in but not necessarily by the same application. So, we need a mechanism to consolidate the data. The easy case is for single field such as title and description. We follow the obvious quality hierarchy of the different formats. To extract the description for instance, we first look for the XMP.dc.description field, then XMP.photoshop.Headline (to support the extensibility mentioned before as a side note), then IPTC.Application2.Caption and finally the Exif.Image.ImageDescription. We only keep the first data found and ignore the others. There is only one title and one description per image: might as well take the one we are sure about.

Consolidation: it becomes even trickier for tags.

puzzle pieces
The singleness of the final result disappearing (we deal with a list of tags, not just one single tag), we cannot ignore the “EXIF” tags as easily as for the title and the description case. Fortunately, IPTC Keywords are supposed to be mapped to XMP (dc:subject). Therefore we can take into account the number of keywords that would be extracted from EXIF and the number that would be extracted from XMP. If those equal, we plainly ignore the EXIF. If they don’t, we try to match each guestimation of EXIF keyword against the XMP keywords to avoid duplicates.
 
All in one, quite an interesting issue where, per design, the final solution is going to be an approximation with different results depending on the context. Computers: an exact science?
Computer History Museum

For more information and main reference: http://www.metadataworkinggroup.com/

Photos by Groume, Raïssa Bandou, seanmcgrath, tcp909, aftab., 姒儿喵喵 and Laughing Squid

Language Detection: A Witch’s Brew?

photo by Arbron

Software developers sometimes have a tendency to cling to bits of ancient lore, even after they’ve long-since become irrelevant. Globalization, and specifically language detection, raises an interesting case-in-point.

These days, language detection is really simple in most cases – just use the “Accept-Language” HTTP header, which pretty much every client on the planet passes to you.

“But wait!” I hear you cry. “I thought the Accept-Language header wasn’t to be trusted?”

“Ancient and outdated lore, my friend” I will reply. “Ancient and outdated lore.”

It’s true that the Accept-Language header has a troubled history. Because of this, many developers regard it the way medieval villagers might have regarded a woman with a warty nose and a pet cat – it should be shunned, avoided and possibly burned at the stake.

In the early days of the web, Accept-Language really couldn’t be trusted – many browsers made no real effort to send a meaningful value, often sending nothing (or worse, defaulting to requesting English) unless the user worked their way through a complicated configuration process buried 3 panes back in a labyrinthine dialog box.

I’m rarely comfortable being categorical when talking about software engineering – there are usually too many variables, and too many specific circumstances for blanket assertions to be of use. But I can state with certainty that the “Accept-Language” header works these days, and any case where it doesn’t can (and will) be considered an error on the part of the browser vendor.

In two and a half years of running as an international site, we’ve only ever had one case where it didn’t work. Helio, a cellphone company, had a browser which was custom-built for them in Korea, and had its “Accept-Language” header hard-coded to always request Korean, something which led to much confusion for the Flickr users amongst their American customers.

When we alerted Helio to the problem, however, they were highly responsive – the next release of their software returned the correct value.

Location != Language

Perhaps driven by their superstitious fear of Accept-Language, many developers fall back on other means of determining their visitors’ preferred language. Top of the list is IP-detection – pinpointing the visitor’s current location using a lookup database. This isn’t necessarily a terrible solution, but it’s prone to a couple of problems.

Firstly, even the best IP location databases contain mistakes, or outdated information where netblocks have been reassigned over time.

Secondly, if you only rely on IP detection in order to provide language, travelers will often be highly confused when they reach their destination, connect to your site and are greeted with a language they don’t speak. This can be especially disconcerting if you’ve been using a site regularly for years. This error is easily avoided, simply by applying a cookie the first time you pick a language for a user. By using a cookie, you can guarantee a consistent experience, regardless of where in the world a laptop happens to fly.

None of this is to say that IP detection doesn’t have its place. It’s a useful fallback if you come across a user-agent that doesn’t send Accept-Language, or you have a specific case where you believe you can’t trust the header. But it general it should be just that – a fallback.

Always an option…

To guard against any possibility that you’ll detect the wrong language for a user, it doesn’t hurt to always provide language-switching as an option on every page. You’ll see that we do that on Flickr (with the exception of Organizr)…

You’ll notice that we always render the language in its native name, so that people can find their preferred language, even if they understand nothing else on the page.

Rights and Wrongs

“But!” some of you will cry. “We are a site that deals with media content. There are rights issues and legal constraints – we have to send people in France to our French site”…

…which nicely brings me to the last point of this post. Many global sites will find themselves dealing with various jurisdictional concerns. There’s no reason, however, that the “legal” logic in your code needs to be tied to the interface presentation.

Whilst certain legal texts (Terms and Conditions, etc) can provide interesting challenges, if you keep presentation as an entirely separate consideration from jurisdiction, it’s much easier to provide everyone with the best possible experience, regardless of where they are and what language they speak.

In Summary

All the above can be reduced to a very simple set of points to cut-out-and-keep for the next time you’re asked to create a Global version of a website.

  • Use Accept-Language – it just works
  • Use IP detection as a fallback for language, or a separate test to determine jurisdiction
  • Cookie your visitors’ language preferences, so you are consistent
  • Provide a simple way to switch language, everywhere
  • Treat language and compliance issues as two entirely separate problems

…at least, that’s how we’ve been doing it on Flickr for two and a half years, and it’s worked out pretty well for us so far.