Benjamin Esham

Unicode’s encoding of national flags is just crazy enough to work

Version 6.0 of the Unicode standard, released in October 2010, added support for emoji. Aside from the classics like 😃 (SMILING FACE WITH OPEN MOUTH), 👍 (THUMBS UP SIGN), and 💩 (PILE OF POO),1 the standard also included several national flags like these:

🇺🇸 🇩🇪 🇬🇧 🇯🇵 🇮🇹

In fact, the standard included every national flag, and in a way that won’t require the standard to be changed when new countries come into being. How did the Unicode Consortium pull this off?

What they did is both crazy and genius. Instead of assigning a codepoint to each flag, which is the obvious way to do it (and the way the rest of the emoji are encoded), the standard defines twenty-six “regional indicator symbols”, from U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A to U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z. In order to include a country’s flag in your text, you first look up the country’s two-letter ISO 3166-1 code and then write the two regional indicator symbols corresponding to those letters. A font with support for that flag treats the two-codepoint sequence as a ligature, replacing the combination with a single pictogram.

Let’s take the United States as an example. Its ISO 3166-1 two-letter code is “us”, so we need to use the codepoints U+1F1FA REGIONAL INDICATOR SYMBOL LETTER U and U+1F1F8 REGIONAL INDICATOR SYMBOL LETTER S. Combining these gives a symbol that renders in your browser as 🇺🇸.

Note well, though, that it’s entirely up to the font designer to decide which flags will be supported. As of this writing, Canada’s flag, which would be encoded as U+1F1E8 U+1F1E6 (for “ca”), is not included in any font available on my computer. (In fact, only ten countries’ flags are available: Japan, South Korea, Germany, China, the United States, France, Spain, Italy, Russia, and the United Kingdom.) Trying to include an unsupported flag in your text gives you some ugly placeholder instead, like “🇨🇦” for Canada.

This encoding scheme seems a little wacky at first, but it lets the Unicode Consortium completely avoid the issue of who gets to be a country and who doesn’t. Some ISO committee is responsible for assigning the two-letter codes and the type foundry is responsible for drawing the flag and actually including it in the font. If your brand-new nation — or Canada, I guess — doesn’t get its own twee icon, that’s not Unicode’s fault.

For developers, this encoding scheme is yet another reminder that bytes in a string and glyphs on screen are two completely different animals. Take again our United States example. It uses two codepoints, U+1F1FA U+1F1F8, which in UTF-8 would be encoded as

F0 9F 87 BA F0 9F 87 B8

Eight bytes for something that’s rendered as a single character!

  1. Unicode 6.0 also specified that every blog article dealing with emoji must use the pile of poo as an example. ↩︎