A code point is a number assigned to a single character. So, this is an example of a code point where we have U plus 0041, an atomic unit of information. So, in a code point, we have 0041, which is a hexadecimal number, and then we have the prefix of U plus, where U equals Unicode, because each code point number is given a meaning by the Unicode standard, and Unicode character sets map each abstract character in the world to a unique number.
So, U plus 0041, we look it up in Unicode, we get Latin capital letter A. Currently, the Unicode standard defines over a million code points. And they all have a one-to-one mapping, so that ensures that there's no collision between alphabets of different languages. The third point to keep in mind is a plane. So, basically, long story short, Unicode divides over a million code points into 17 planes or groupings. These planes are represented here.
So, the first plane, plane 0, is the basic multilingual plane, also known as the BMP. And that's the unification of all the prior character sets. So, that includes ASCII, and Chinese, Japanese, and Korean characters. And this is what the BMP looks like. And I think it's fascinating to see the breakdown of different scripts included. You know, you can see East Asian scripts and the South, the Chinese, Japanese, Korean characters include a lot of additional code points. So, the BMP is four hexadecimal digits. And then, outside the BMP, BMP plane one is five hexadecimal digits. And plane 16 is six hexadecimal digits. And outside of the BMP is the astral plane, or the supplementary planes.
So, how does this relate to our head tag with UTF-8? That we've all, you know, are used to seeing. So, we know about code points. Abstract characters, like U plus 0041 is A. And we know about code units, or physical bits, because, you know, computers, at the memory level, don't use code points or abstract characters. They need a physical way to represent Unicode code points. So, computers translate from Unicode code points into physical bits using a character encoding translation, which takes the transformation from code point into physical bits. And Unicode has this popular open source character encoding translation algorithm called the Unicode Transformation Format or UTF, which does this job for us. And popular encodings of UTF are UTF-8, UTF-16, and UTF-32. So, UTF is really cool. It's reversible, so the conversions between all of them are algorithmically based. That's hard to say.
Comments