A Chinese puzzle: Unicode and EXIF metadata parsing

At flickr, there are quite a few photos and you can browse the site in 8 different languages, including Korean and Chinese. Common metadata such as title, description and tags can be pre-populated based on information contained in the image itself, using what is commonly called EXIF information. So, it makes sense to implement this with respect to language and, above all, alphabet/character encodings.

Well, what made sense did not make so much sense anymore when using existing specifications. Here is how we coped with them.

The standards

To begin with, there is not just EXIF. Metadata can actually be written within a picture file in at least 3 different formats:
– EXIF itself.
– IPTC-IIM headers.
– XMP by Adobe.

Of course, these formats are neither mutually exclusive nor completely redundant and this is where this can get tricky.

But let’s step back a moment to describe the specifics of these formats, not in details, but with regards to our need, which is to extract information in a reliable way, independently from how/when/where the image was created.

The EXIF is the oldest form

Hence, with the most limitations yet the most widespread, as this is often the case in our industry. Thus, we need to deal with it, even though it is radically flawed from an internationalization stand point: text fields are not stored using UTF and most of the time there is no indication of the character set encoding.

IPTC-IIM

In its later versions, it added the optional (Grr!!!) support for a field indicating the encoding of all the string properties in the container.

XMP

At last, text from XMP format is stored in UTF-8. On a side note, the XML based openness of this format is not making things easier, for each application can come up with its own set of metadata. Nevertheless, from an internationalization and structural standpoint, this format is modernly adequate: finally!

So, now that we know what hides behind each of these standards we can start tackling our problem.

A solution ?

Rely on existing libraries, when performant

For flickr Desktop client (Windows, Mac), we are using Exiv2 Image metadata library, which helps the initial reconciliation between all fields (especially with EXIF and IPTC-IIM contained within XMP).

The “guesstimation” of character set from EXIF

We first scan the string to see if all bytes are in the range: 0 to 127. If so, we treat the string as ASCII. If not, we scan the string to see if it is consistent with valid UTF-8. If so, we treat the string as UTF-8. Checking against UTF8 validity is not a bullet-proof test. But statistically, this is better than any other scenario. At the last resort, we pick a “reasonable” fallback encoding. For the desktop application, we use hints from the user system. On windows, the GetLocaleInfo function provides with the user default ANSI code page (LOCALE_IDEFAULTANSICODEPAGE), which can be used to specify an appropriate locale for string conversion methods. On Mac OS X, CFStringGetSystemEncoding is our ally. In our case, there is no point to use the characterset of our own application, which we control and is not linked to the characterset of the application that “wrote” the EXIF.

Consolidation: the easy case

The workflow followed by the image we handle is unknown. We can have all 3 formats filled-in but not necessarily by the same application. So, we need a mechanism to consolidate the data. The easy case is for single field such as title and description. We follow the obvious quality hierarchy of the different formats. To extract the description for instance, we first look for the XMP.dc.description field, then XMP.photoshop.Headline (to support the extensibility mentioned before as a side note), then IPTC.Application2.Caption and finally the Exif.Image.ImageDescription. We only keep the first data found and ignore the others. There is only one title and one description per image: might as well take the one we are sure about.

Consolidation: it becomes even trickier for tags.

The singleness of the final result disappearing (we deal with a list of tags, not just one single tag), we cannot ignore the “EXIF” tags as easily as for the title and the description case. Fortunately, IPTC Keywords are supposed to be mapped to XMP (dc:subject). Therefore we can take into account the number of keywords that would be extracted from EXIF and the number that would be extracted from XMP. If those equal, we plainly ignore the EXIF. If they don’t, we try to match each guestimation of EXIF keyword against the XMP keywords to avoid duplicates.

All in one, quite an interesting issue where, per design, the final solution is going to be an approximation with different results depending on the context. Computers: an exact science?

For more information and main reference: http://www.metadataworkinggroup.com/

Photos by Groume, Raïssa Bandou, seanmcgrath, tcp909, aftab., 姒儿喵喵 and Laughing Squid