So if we look at the spec ECMAScript, the standardized version of JavaScript defines how characters should be interpreted as either UCS 2 or UTF 16. So UCS 2 is a two-byte universal character set versus UTF 16, which is a 16-bit Unicode transformation format. So these are two different contrasting systems. So it creates a lot of really weird and interesting JavaScript bugs when dealing with encoding of characters. Because of these contrasting systems, one is always two bytes and the other uses surrogate pairs.
So let's talk about some of those common bugs that we see. So the first one is with combining marks. So combining marks is where we take a Unicode code point for the Latin small letter A, and then we add a combining mark code point. Here we have a combining ring above it. That's how it's for a Danish atomic graph name character that has A with a little circle over it, so the A with a little circle over it are two separate code points that combine to make one. So if we take these two separate characters and console log them, we can see what's returned is the A with a little circle above it. Great, but because these are two separate characters, we can see some problems.
So similar to that A with a circle above it, we have the E with the, the accented E in cafe, you know that we see in Spanish. So if we define that variable is drink, with C-A-F-E with the final combining mark as the accent in cafe, and then we console log that variable drink, we can see cafe is returned. But then if we check drink.length, the length is five. And if we try and split that string into an array, we can see that the final index of the array is that combining mark. So, as I'm sure you can imagine, a lot of confusion will arise when trying to do string manipulation with combining marks, when you forget that these code sequences exist. So how can we handle that? Is using the normalize method that's built into JavaScript. So string.prototype.normalize returns the unicode normalization form of the string. Which is really handy, straight out of JavaScript. So if we take that same variable drink, that's cafe, with the combining mark, and normalize that string, we can now see that the length is returned as four, and the final index of the array when split is e with the accent above it. So we don't get that extra index with the combining mark. So this is really great to keep in mind to avoid localization bugs when doing string manipulation. But it's also really important from the human perspective. So I saw this tweet recently that says, found on a US government website, everyone this is what systemic racism is, it's when folks are excluded. So this person Raquel Velez, her last name is in Spanish and it has an accent over the e, but she can't enter her last name in the last name field of a US government form because of the combining mark in Unicode that is creating and coding issues here. So by fixing simple string and coding issues, we can help solve these problems on the human level.
So let's move on to tip number two, to wrap all strings in an object for translation. So from a really high-level kind of pseudo code example here we can take a hard-coded hello string of hello world and then, so this is a hard-coded string, this is what not to do, but what we can do instead is take this string and wrap it in an internationalization object where we have a list of resources for each of the different locales. So here we can see English has an object with the key of hello message that equals the key value, the value is hello world.
Comments