phil@dler - Unicode: Addressing the ith Character

Overview

I think about unicode a lot. I will be the first to concede that it has issues. Boy does it ever have issues, and I will talk about them at length to anyone who happens to ask me. That article is coming, but it is not this article.

But I also think it’s the least bad alternative we’ve had so far. And I think the oftentimes religious way in which I see many software engineers swear that unicode is the One True Way speaks very strongly to how bad things were prior to the widespread use of unicode.

I should clarify for the pedants out there that yes, when I refer to unicode, strictly speaking I am refering to UTF-8, UTF-16 and UTF-32.

An exchange I have seen a lot when I have been falling down the rabbit hole on the subject often goes something like this:

Programmer 1: “I don’t like unicode, I cannot find the ith character quickly”

Programmer 2: ”Unicode is the One True Way, you don’t really need to randomly access strings like that.”

Now, however right Programmer 2 is, this tack of conversation really gets under my skin as a piece of presumptive rhetoric. Moreover, it doesn’t actually help anyone learn how to use unicode in their particular situation that might actually let them access the elusive ith character.

This is only more annoying if you include this as a response to this request as part of your how-to in your programming language. Mentioning no names, of course.

Accessing the ith code point

By now, I assume that if you’re interested in this subject matter, you’re probably well informed enough about the unicode encodings to know why you can’t simply go to an arbitrary string of utf8 bytes in (low level) programming lanagues and expect that something like:

char x = string[i];

Will not result in char x containing the ith char.

What is true, is that with utf32, if you have an array of those, you can do something like:

uint32 x = u32_string[i];

As an aside; I actually started with the first draft of this article with a whole fancy thing using a vector of 8-bit chars to permit the direct indexing of a code point, before twigging that this was a net loss of memory space when compared with simply using utf-32.

And expect x to hold the ith code point. Note that this is not the same as the ith character. And this is where unicode becomes utterly maddening to work with; because a character is a distinct concept from a codepoint. It is also very loosely defined.

There are two major classes of codepoints that cause headaches, the first is codepoints that don’t have a glyph behaviour. Examples include the left-to-right and right-to-left directional markers, and the ‘combining grapheme joiner’ (whose role is, incidentally, the opposite of what one might presuppose based on the name).

Nevertheless, we’ve managed to get as far as addressing the ith code point. Now what about the ith character?

The ith Character

The problem with the ith character is really that there isn’t a good definition of what we mean by character. We could argue for a character being a glyph - but there are edge cases even there. Hebrew vowels appear as diacritics when written out in full; are they full glyphs in their own right? Or is the glyph the combination of the consonant and the vowel diacritic. Is that a character?

This logic leads to surprising places in European languages very quickly, the ess-zett in German (ß), capitalises in unicode to SS in unicode. We possibly ought to consider quite seriously making SS a single character rather than a combination of two S characters.

This is to say nought of the ‘ough’ group of characters in English, which present with many different sounds in different words; is each sound of ‘ough’ a different character with an abnormally long glyph?

Unicode remains somewhat irksomely aloof to these questions, and doesn’t really answer any of them. Or indeed, many of the other questions that keep me awake when I’m thinking about unicode.

The Ith Glyph

The closest I have gotten though, is to get to the ith glyph.

Given that we know which characters are combining characters, or have no glyph, it should be possible (provided we have enough memory for a character table) to define an array of elements thus:


struct glyph_ref {
        char* start;
        size_t len;
        (some_type)* local_property_flags;
        (some_second_type)* global_property_flags;
}

glyph_ref[] glyph_string;

This then gives us an array in which we can directly reference, and indeed order (without shuffling the innards of) glyphs within an array.

A counter-argument I hear a lot from programmers is that having that character table in RAM is inelegant. At the moment, the highest code point in use is 0x10FFFF. Assuming you absolutely insist on having a character table in which each character’s data is addressable by it’s code point ID, then you are confessedly getting on for >100MB without breaking too much of a sweat. But, still, if that fast look up is that vital for your program, it is within reach of most modern machines with room to spare.

For the sake of brevity, I’m not going to discuss comparison, because that gets into normalisation. Suffice it to say, that because of the presence of both combining diacritics and code points representative of precomposed glyphs, there is more than one way to place an accent above an e in unicode (among so many other things), and so you have to normalise the strings as part of your input sequence to be able to perform such comparisions.

Due consideration should also be given to the fact that some code points (such as the left-to-right and right-to-left markers) confer properties to code points before and after them in the utf encoding strings; this is why I have placed the somewhat unconvincing (some_type) for the representation of those properties which have been conferred to the glyphs, or are intrinsic to the glyphs. The fact I haven’t specified exactly how those flags have been stored leads me neatly to the next section.

I’m Working in or am a Constrained System

No, that heading is not a typo; I say this as a person who is often writing code solo. I don’t have the mental resources to cope with the whole of unicode, it is at once both too large and too complex for me to get right (although, it currently consumes surprisingly little of it’s 32-bit address space). Moreover, my programs rarely need to be able to handle all of the potential edge cases, and where they do, I am frequently better of simply refusing to handle strings which represent them than I am handling them properly. Even large and wealthy organisations like Apple have struggled with these problems.

This applies doubly to those brave souls who work in constrained systems such as embedded hardware, and so don’t have the luxury of (easily hundreds of) megabytes of RAM to unpack utf32 encoded strings into, or the resources to store large property tables of which characters count as combining diacritics.

But here’s the rub, and it’s the logic that underpins this whole post; you really don’t have to handle all of unicode, and no one can make you. More exactingly, you don’t ultimately have to use unicode internally to your program at all, so long as it can communicate properly with the systems that provide and consume its data.

We only have to permit the use of the ranges which are pertinent to the programs we are running. And if you think I am committing some kind of cardinal sin with that sentence, consider that we have been defining acceptable inputs and syntaxes for programs for about as long as they have existed. Why then, would limiting the space of unicode that our program is going to tolerate be any different?

I’ll actually take this a step further and say that for some uses, we absolutely should be limiting ourselves to subsets of unicode; there have already been a number of security threats identified from typographically similar codepoints which are encoded differently being used for credential stealing and similar attacks.

Thus, if you want to internally represent characters using one byte and only use 255 hand chosen characters from the unicode table, that is your perogative. Likewise if you want to constrain your program to only those characters or glyphs that can be represented by the 16-bit subset of unicode. Provided your sofware does something sensible when presented with data that it declines to process, and can map it’s desired inputs in and out to be communicable with the rest of the world, I don’t see the problem.