Skip to the content.

UTF-8

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit.

UTF-8 is by far the most common encoding for the World Wide Web, accounting for over 96% of all web pages, and up to 100% for some languages, as of 2021. UTF-8 is the recommendation from the WHATWG for HTML and DOM specifications, and the Internet Mail Consortium recommends that all e-mail programs be able to display and create mail using UTF-8.

Encoding Rules

UTF-8 uses 1 ~ 4 bytes to encode each character, and can represent any standard Unicode character:

UTF-8 encoding rules: if there is only one byte, the most significant bit is 0; if it is multi-byte, the first byte starts from the most significant bit, and the number of consecutive binary bits is 1. The number of bytes it encodes. The remaining bytes begin with 10.

The layout of UTF-8 byte sequences:

Number of bytes First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
1 U+0000 U+007F 0xxxxxxx - - -
2 U+0080 U+07FF 110xxxxx 10xxxxxx - -
3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx -
4 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

More