Search notes:

Unicode Character

The Unicode standard Version 14.0 contains 144697 characters.
A character is the minimum unit of text with semantic value.
There are mutliple ways to represent a character, one of which is UTF-8.

Normalization

Some Unicode chracters can be (equivalently) represented with multiple binary representations.
For example, the character ắ is represented by the following code points:
Needless to say that this phenomonen complicates some text related operations such as searching, sorting and matching.
In order to somewhat make things easier, the Unicode standard defined processes with which the different representations of a characters can be normalized into one representation.
After normalizing Unicode strings, they can then be compared or operated on more easily.
These processes, or algorithms, are referred to as normalization forms, for example C, D, KC or KD.
See also the .NET class System.Text.NormalizationForm.

See also

character set
The .NET class System.Globalization.CharUnicodeInfo provides information about a Unicode character.
TODO: should this page be merged with Code points?

Index