With Unicode adding more and more useless emoji, and seemly doing little else, it’s time to ask an important question: what the fuck is the Unicode Consortium supposed to be doing anyway?
It’s time to dust off Howard Oakley’s excellent blog post Why we can’t keep stringing along with Unicode, and think about the Normalization problem for file names and the Glyph Variation problem of CJK font sets. These problems fit together surprisingly well. My take is the problems must be tackled together as one thing to find a solution. Let’s take a look at the essential points that Oakley makes:
Unicode is one of the foundations of digital culture. Without it, the loss of world languages would have accelerated greatly, and humankind would have become the poorer. But if the effect of Unicode is to turn a tower of Babel into a confusion of encodings, it has surely failed to provide a sound encoding system for language.
Neither is normalisation an answer. To perform normalisation sufficient to ensure that users are extremely unlikely to confuse any characters with different codes, a great many string operations would need to go through an even more laborious normalisation process than is performed patchily at present.
Pretending that the problem isn’t significant, or will just quietly go away, is also not an answer, unless you work in a purely English linguistic environment. With increasing use of Unicode around the world, and increasing global use of electronic devices like computers, these problems can only grow in scale…
Having grown the Unicode standard from just over seven thousand characters in twenty-four scripts, in Unicode 1.0.0 of 1991, to more than an eighth of a million characters in 135 scripts now (Unicode 9.0), it is time for the Unicode Consortium to map indistiguishable characters to the same encodings, so that each visually distinguishable character is represented by one, and only one, encoding.
The Normalization Problem and the Gylph Variation Problem
As Oakley explains earlier in the post: the problem for file system naming boils down to the fact that Unicode represents many visually-identical characters using different encodings. Older file systems like HFS+ used Normalization to resolve the problem, but it is incomplete and inefficient. Modern file systems like APFS avoid Normalization to improve performance.
Glyph variations are the other side of the coin. Instead of identical looking characters using different encodings, we have different looking characters that are variations of the same ‘glyph’. They have the same encoding but they have to be distinguished as variation 1, 2, 3, etc. of the parent glyph. Because this is CJK problem, western software developers traditionally see it as a separate problem for the OpenType partners to solve and not worth considering.
Put another way there needs to be an unambiguous 1-to-1 mapping and an unambiguous 1-1/1-2/1-3-to-1 mapping. I say the problems are two sides of the same coin and must be solved together. Unicode has done a good job of mapping things but it is way past time for Unicode to evolve beyond that and tackle bigger things: lose the western centric problem solving worldview (i.e. let’s fix western encoding issues first and deal with CJK issues later), and start solving problems from a truly globally viewpoint.