Emoji Encoding, � Unicode, & Internationalization

Rate this content
Bookmark

Why does '👩🏿‍🎤'.length = 7? Is JavaScript UTF-8 or UTF-16? What happens under the hood when you set ? Have you ever wondered how emoji and complex scripting languages are encoded to work correctly across browsers and devices - for billions of people around the world? Or how new emoji are introduced and approved? Have you ever seen one of these: □ � “special” glyph characters before and want more information on why they might appear and how to avoid them in the future? Let’s talk about Unicode encoding in JavaScript and across the world wide web! We’ll go over best practices, common pitfalls, and provide resources to learn more - even where to go if you want to submit a new emoji proposal! :)

This talk has been presented at JSNation Live 2020, check out the latest edition of this JavaScript Conference.

FAQ

UTF-8 is commonly used in web development for encoding web pages. It ensures that text appears correctly across different platforms and devices, handling various character sets efficiently.

Unicode assigns a unique code point to every character, no matter the platform, program, or language, ensuring consistent representation across different systems. It supports over a million code points which cover a comprehensive range of characters and symbols.

The meta charset UTF-8 line in HTML specifies that the character encoding for the webpage is UTF-8. This is crucial for correctly displaying text that includes special characters and symbols from multiple languages.

In JavaScript, emojis might be encoded as multiple code units, especially those outside the Basic Multilingual Plane (BMP). This can make the length of a string containing emojis appear longer than the number of visible characters.

Developers can use JavaScript's string normalization features, like the normalize() method, to handle emoji encoding correctly. This method adjusts the string's encoding so that each emoji is consistently counted as a single character.

ASCII was initially designed to encode 128 English characters, which was insufficient for languages with diacritics and other symbols. This limitation led to the development of extended ASCII and eventually Unicode for broader language support.

The zero-width joiner (ZWJ) is used in emoji sequences to combine multiple emojis into a single glyph. For instance, family emojis are created by combining individual human emojis with a ZWJ, resulting in a single, composite representation.

Surrogate pairs are used in UTF-16 encoding to represent characters outside the Basic Multilingual Plane. These characters require more than 16 bits and are encoded using two 16-bit code units.

UTF-8 is recommended for the web due to its efficiency in encoding a vast range of characters using 1 to 4 bytes, compatibility across different systems, and its ability to handle any Unicode character, making it ideal for international environments.

Naomi Meyer
Naomi Meyer
34 min
18 Jun, 2021

Comments

Sign in or register to post your comment.

Video Summary and Transcription

The video discusses the importance of understanding emoji encoding, specifically focusing on emoji UTF-8 encoding and how emojis are encoded. It highlights that UTF-8 is recommended for web use due to its compatibility with various systems and efficient handling of different character sets. The talk delves into how Unicode handles different languages and symbols, assigning unique code points to each character, which ensures consistent representation across platforms. JavaScript's string.prototype.normalize method is suggested to avoid localization bugs and handle emojis correctly. The history of encoding is also covered, from ASCII to the introduction of Unicode in 1991, which aimed to resolve issues with non-English characters. The video emphasizes the role of the zero-width joiner in creating compound emojis and discusses the challenges of string length in JavaScript when dealing with emojis. The talk concludes with insights into the Unicode Consortium and how new emojis can be proposed.

1. Introduction to UTF-8 Encoding and Emojis

Short description:

I'm Naomi Meyer, a software development engineer at Adobe, and today I'll be talking about the UTF-8 encoding and how it relates to emojis. We'll also touch on the Unicode Consortium, the history of encoding, and the importance of considering global usage when building software products. Let's start by understanding how characters are interpreted by computers, and then delve into the first encodings, specifically ASCII.

Hi, thanks for that great introduction. I'm Naomi Meyer, and I work as a software development engineer at Adobe, where I do localization and internationalization engineering for creative products like Adobe Fonts and Adobe Portfolio. So here's where you can find me online, and if there's anything that you feel passionate about or have strong opinions on, please let me know. I'd love to continue this conversation online.

So I'm sure most of us have seen this many times in the head tag of our HTML markup, and we all know to add this meta charset UTF-8 line. But I've been fascinated lately about the underlying details of what this UTF-8 thing really means and does. So I'm excited to share some more details about what it is and why I think it's so cool. Also, this connects with this JavaScript funniness that we can see here with these seemingly character emojis, when as strings in JavaScript have a length a lot longer than expected. And feel free to try this yourself in your dev tools. I know I wanted to test it out first when I first saw something like this. And part of my goal today is to talk about why this is, why you know familyEmoji.length is equal to 11 why that's true. And provide some more details about the underlying encoding happening here and what we can do to handle it correctly.

So speaking of emojis, I personally find them both delightful and intriguing from an engineering perspective, a linguistic perspective, a creative design perspective, a cultural sociological international perspective, and so much more. Our agenda for today is to start with a bit of encoding history to understand more of where we came from. Then we'll get into the Unicode Consortium and the UTF-8 algorithm. Then we'll talk about how this allows us to encode emojis and different languages across platforms, devices and operating systems. Overall, I think it's so important for us to keep these big ideas in mind when we build software products that are being used globally. This is kind of a timeline, a broad timeline, of what we'll touch on today. We've got a lot to cover in these 20 minutes. Let's jump right in with encoding to start.

Of course, when we're on our computers and we type an emoji, a letter, a character in any language, these are ultimately interpreted by the machine as zeros and ones. Let's get into a bit about how that works. In order to understand how it's working today, let's go back in the past to the 1960s of the first encodings. Back in the 60s, there were these big computers that filled a whole room. This is a picture of one from NASA. Engineers back then came up with a system. That system is called ASCII, the American Standard Code for Information Interchange. This image is from the first version that was published in 1963. ASCII was developed from telegraph code. It was originally built for more convenient sorting of lists, alphabetically by ascending, descending characters.

2. ASCII and the Birth of Unicode

Short description:

In 1963, ASCII was encoded in 128 English-only characters into 7-bit integers. Over time, bugs arose due to the inclusion of non-English characters with diacritics and accents. The internet exacerbated the problem of conflicting encodings, leading to errors and question marks. In 1991, Unicode version 1 was introduced as a universal encoding standard to address this issue. Unicode Today's mission is to enable everyone to use their own language on devices. Unicode version 1.0, published in October 1991, had 7,161 characters. Understanding Unicode requires a shift in thinking about abstract characters and code points.

In 1963, ASCII was encoded in 128 English-only characters into 7-bit integers. So, I think ASCII is pretty cool because it makes sense on how it was built.

So first, we take a character, like the letter A, and we assign an ASCII decimal number, 65, which in binary is equal to 1, five zeros, one, in the original 7-bit system. And then after 65 goes 66, which is B, and we continue alphabetically, all the way to Z, which is ASCII decimal 90. Then to go from uppercase to lowercase characters, we only change one bit, 32 letter, 32 later, which is ASCII decimal number 97. And that continues alphabetically. So I think it's a cool system. And shout out to Tom Scott on this video that he has to explain it really clearly.

So ASCII makes sense, but then, and it kind of worked in America in the 60s. But over time, there were lots of bugs. And those bugs came because non-English characters, like those pictured here, include diacritics and additional accents that were added. So ASCII was originally worked with seven bits. But then computers moved to eight bits, and we went from 128 characters to 256 characters. And different countries and different language systems added more characters with those extra 128. And different languages like Japanese, for example, did its entire own thing. They had a separate multi-byte encoding system, you know, and Japanese, Russian, all these languages had a different encoding system. And that was fine when they worked independently. But then came the internet. And with the worldwide web, the internet kind of broke computers because there was this problem with no universal encoding system. When two different non-compatible encoding systems encountered one another, we got these types of errors like we see here with a lot of question marks and a lot of bugs.

So, in 1991, with the worldwide web, we got Unicode version 1. And Unicode was designed to be a universal encoding standard to solve this problem of conflicting character encodings. So, Unicode Today is a non-profit whose mission is that everyone in the world should be able to use their own language on phones and computers. Unicode version 1.0 was published in October 1991 and had 7,161 characters. To understand and think about Unicode, you kind of have to make a mental shift of your assumptions about language and characters. So, there's three kind of big ideas to think about. Props to Dimitri Pavloutin who has this great article I recommend called What Every Javascript Developer Should Know About Unicode. So, the first idea to keep in mind is abstract characters. So, instead of thinking about letters in an alphabet, it's good to think about abstract characters and Unicode deals with characters in these abstract terms. Second, we have code points.

QnA

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

How do Localise and Personalize Content with Sanity.io and Next.js
React Advanced Conference 2021React Advanced Conference 2021
8 min
How do Localise and Personalize Content with Sanity.io and Next.js
Sanity.io provides a content platform for structured content that replaces traditional CMS. Their solution allows businesses to structure and query content anywhere using the Sanity studio and open source React application. The talk focuses on solving the challenge of sending personalized data to users in a static website environment using Next.js Vercel for hosting and Sanity for content querying and delivery. The Sanity studio allows for modeling pages, articles, and banners, with banners being shown to visitors based on their country. The solution involves using Grok queries to fetch the right banner based on country information, demonstrating personalization based on localization and dynamic content querying.
End-to-end i18n
React Advanced Conference 2021React Advanced Conference 2021
26 min
End-to-end i18n
Thanks for joining my talk on end-to-end internationalization. I'll walk you through internationalizing a React app, covering translated strings, currency and date formatting, translator management, and injecting translated strings back into the app. The constants used throughout the app define localization settings and translations. The React Intel library is used for managing translations, and custom functions are created for consistent date and number formatting. The translation process involves extracting strings, using tools like PO Edit, and compiling the translated strings into JSON files for the React app.
Building JS Apps with Internationalization (i18n) in Mind
JSNation 2022JSNation 2022
21 min
Building JS Apps with Internationalization (i18n) in Mind
This Talk discusses building JavaScript apps with internationalization in mind, addressing issues such as handling different name formats, using Unicode for compatibility, character encoding bugs, localization and translation solutions, testing in different languages, accommodating translated text in layouts, cultural considerations, and the importance of enabling different languages for users. The speaker also mentions various open source tools for internationalization. The Talk concludes with a reminder to avoid assumptions and embrace diversity in the World Wide Web.
Internationalizing React
React Summit Remote Edition 2021React Summit Remote Edition 2021
29 min
Internationalizing React
The Talk discusses the challenges of adding and maintaining new languages in a React application and suggests ways to make the process easier. It defines internationalization as the process of architecting an application for localization and explains that localization involves adapting the content and experience for a specific locale. The speaker recommends using libraries for internationalization and provides resources for learning more about the topic. The Talk also addresses edge cases and difficulties with multiple dialects or languages, translation processes, and right-to-left CSS libraries.
Localization for Real-World Use-Cases: Key Learnings from Onboarding Global Brands
React Summit 2022React Summit 2022
8 min
Localization for Real-World Use-Cases: Key Learnings from Onboarding Global Brands
I'm going to talk about localisation in the real world and how Sanity, a platform for structured content, focuses on content as data and flexible internationalization. Sanity allows for multiple languages in different markets, providing customization options for content visibility based on location. The platform offers a flexible approach to content creation and customization, allowing organizations to internationalize their content based on their specific needs. With Sanity's query language and the brand new version of Sanity Studio, developers have more flexibility than ever before.
i18n Was the Missing Piece: Let 70%+ of the Users in the World to Access Your Apps
JSNation 2023JSNation 2023
13 min
i18n Was the Missing Piece: Let 70%+ of the Users in the World to Access Your Apps
Today's Talk explores the impact of I18n and DEX for developers, the challenges of I18n, and the importance of understanding different approaches. It discusses determining languages and regions using IP address, browser settings, and URL patterns. The Talk also covers translation loading, using the i18xProvider, and addressing issues such as URL parameters and maintaining translation files. Additionally, it explores connecting with a Content Management System, implementing folder level translation, and utilizing code splats for dynamic routes.

Workshops on related topic

Localizing Your Remix Website
React Summit 2023React Summit 2023
154 min
Localizing Your Remix Website
WorkshopFree
Harshil Agrawal
Harshil Agrawal
Localized content helps you connect with your audience in their preferred language. It not only helps you grow your business but helps your audience understand your offerings better. In this workshop, you will get an introduction to localization and will learn how to implement localization to your Contentful-powered Remix website.
Table of contents:- Introduction to Localization- Introduction to Contentful- Localization in Contentful- Introduction to Remix- Setting up a new Remix project- Rendering content on the website- Implementing Localization in Remix Website- Recap- Next Steps