Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Linux \ Character Encoding notes: ASCII, Unicode and UTF-8     - Computer security perimeter recommendations (Linux)

- SUSE Linux install Oracle 10g and problem solving (Linux)

- Linux batch copy data script (Linux)

- Installation Enpass secure password manager on Ubuntu (Linux)

- CentOS installation Percona Server 5.5.42 compiling problem solve one case (Linux)

- The Concept and Semantics of Java Memory Model (Programming)

- Vim useful plugin: EasyGrep (Linux)

- Based Docker build stand-alone high-availability cluster Hadoop2.7.1 Spark1.7 (Server)

- After restarting network services, DNS address failure (Linux)

- Help you make Git Bisect (Linux)

- Android custom ViewPager create kaleidoscopic image transition effects (Programming)

- Oracle for Oracle GoldenGate to achieve a one-way synchronization DDL operations (Database)

- Raspberry Pi 2 to install the latest version of the FPC and Lazarus 1.5 (Linux)

- C / C ++ language variable scope: local variables, global variables, file-level variables (Programming)

- Use scripts easily install the latest Linux kernel in Ubuntu (Linux)

- Necessity in Java packages (Programming)

- Oracle inverted reverse function (Database)

- The ORA-01113 error is handled with BBED without archiving (Database)

- MariaDB 10.0.X, the dynamic column support JSON format to obtain data (Database)

- Ubuntu system grub repair method (Linux)

 
         
  Character Encoding notes: ASCII, Unicode and UTF-8
     
  Add Date : 2018-11-21      
         
         
         
  I suddenly want to find out the relationship between Unicode and UTF-8, so they started in the online information.

As a result, this complex problem than I thought, after lunch from 21:00 has been seen, be considered preliminary clear.

Here are my notes, it is mainly used to organize his thoughts. However, I tried to write easy to understand as much as possible, hoping to be useful to other friends. After all, the character encoding is the cornerstone of computer technology, skilled use of computer you want, it is important to understand the character encoding knowledge.

1. ASCII code

We know that inside the computer, all the information eventually to a binary string representation. Each binary digit (bit) has two states 0 and 1, so the eight bits can be combined out of 256 states, which is called a byte (byte). That is, a byte total can be used to represent 256 different states, each state corresponding to a symbol, that is 256 symbols, from 0,000,000 to 11,111,111.

The 1960s, the United States developed a set of character encodings, character of the relationship between the English and the bit between do uniform regulations. This is called ASCII code, still in use.

ASCII code specifies a total of 128 coded characters, such as spaces "SPACE" is 32 (binary 00100000), uppercase letter A is 65 (binary 01000001). This 128 symbols (including the 32 control symbols can not be printed out), behind only takes a byte 7, a unified front is defined as 0.

2, a non-ASCII encoding

English with 128 coded symbols enough, but used to represent other languages, 128 symbols is not enough. For example, in French, there are phonetic symbols above the letters, it can not be represented by ASCII codes. As a result, some European countries decided to use the idle byte MSB incorporated into the new symbol. For example, the French é coded as 130 (binary 10000010). As a result, these European countries use the coding system, it may represent up to 256 symbols.

However, here again there is a new problem. Different countries have different letters, therefore, even if they all use the 256 symbol encoding, on behalf of the letters is not the same. For example, 130 in French coding represents the é, in Hebrew encoding it represents the letter Gimel (ג), in Russian encoding will sign on behalf of another. But anyway, all of these encoding, the symbols represent 0-127 is the same, not the same in this paragraph is just 128--255.

As for text Asian countries, symbols used even more, as many as 10 million Chinese characters. A byte can represent 256 kinds of symbols, it is definitely not enough, you must use multiple bytes to express a symbol. For example, Simplified Chinese common encoding is GB2312, use two bytes to represent a Chinese character, so in theory can represent up to 256x256 = 65536 symbols.

Chinese encoding problem discussed special needs, this note does not involve. Here only point out, though they are using multiple bytes to represent a symbol, but Unicode GB class character encoding later and UTF-8 is unrelated.

3.Unicode

As mentioned in the previous section, there is a variety of encoding the world, with a binary number can be interpreted as different symbols. Therefore, in order to open a text file, you must know the encoding used, otherwise the wrong encoding interpretation, will be garbled. Why Email is often garbled? Because encoding sender and recipient are using is not the same.

Imagine, if there is a code, all the symbols of the world are included. Each symbol is given a unique code, then the garbage problem will disappear. This is Unicode, as its name represents, which is a symbol of all the coding.

Unicode is a big collection of course, now the size can accommodate more than 100 million symbols. Each code symbol is not the same, for example, U + 0639 represents Arabic letter Ain, U + 0041 represents the English capital letters A, U + 4E25 represents the Chinese character "strict." Specific symbol correspondence table, you can query unicode.org, or special characters corresponding to the table.

4. Unicode problems

Note, Unicode is just a set of symbols, it only specifies the symbol binary code, but does not specify how this should be stored in binary code.

For example, the Chinese character for "strict" unicode hexadecimal number 4E25, is converted into a binary number a full 15 (100,111,000,100,101), which means that this symbol indicates the need at least two bytes. Other symbols indicate greater, may require three bytes or 4 bytes, or even more.

Here there are two serious problems, the first question is, how can the difference between Unicode and ASCII? Computer know how three bytes to represent a symbol, rather than three symbols represent it? The second problem is that we already know, the English alphabet with only one byte is enough, if Unicode unified regulations, each symbol with three or four bytes, then the former are bound to each letter has two to three bytes is 0, which is a tremendous waste for storage, the text file size is large and therefore the two to three times, which is unacceptable.

They cause the results are: 1) the emergence of a variety of storage Unicode, which means there are many different types of binary format, can be used to represent Unicode. 2) Unicode can not promote a long period of time, until the advent of the Internet.

5.UTF-8

Popularity of the Internet, the emergence of a strong demand unified coding. UTF-8 is the most widely used on the Internet a Unicode implementation. Other implementations also include UTF-16 (character two bytes or four bytes) and UTF-32 (characters with four bytes), but the basic need on the Internet. Repeat, the relationship here is, UTF-8 is one of the Unicode implementation.

UTF-8 biggest feature is that it is a variable-length encoding. It can be used from 1 to 4 bytes to represent a symbol, the symbol changes depending on the length in bytes.

UTF-8 encoding rule is very simple, only two:

1) For single-byte symbols, the first byte is set to 0, 7 followed by the symbol of the unicode code. So for the English alphabet, UTF-8 encoding and ASCII codes are the same.

2) For n bytes of symbols (n> 1), the first byte of the first n bits are set to 1, the n + 1 bit is set to 0, followed by the first two bytes will be set to 10. The remaining bits not mentioned, all of this symbol unicode code.

The following table summarizes the encoding rules, the letter x indicates bits available for encoding.

Unicode symbol range | UTF-8 encoding
(Hex) | (binary)
-------------------- + ----------------------------- ----------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to the table with, interpret UTF-8 encoding is very simple. If the first byte is 0, then this is a single-byte character; if the first bit is 1, the number of consecutive 1, it indicates that the current character takes the number of bytes.

Below, or to the Chinese character "strict" for example, shows how to implement UTF-8 encoding.

Known "strict" unicode is 4E25 (100111000100101), based on the table, can be found in the range of 4E25 third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes that format is "1110xxxx 10xxxxxx 10xxxxxx". Then, from the "strict" last bit Start, fill in the format x from back to front, the extra bit 0s. This was "strict" UTF-8 encoding is "11100100 1,011,100,010,100,101", converted to hexadecimal is E4B8A5.

6. Unicode and conversion between UTF-8

Through an example, you can see the "strict" Unicode code is 4E25, UTF-8 encoding is E4B8A5, the two are not the same. The transitions between them can be realized by a program.

In the Windows platform, there is a simple transformation method is to use the built-in Notepad small program Notepad.exe. After opening the file, click "File" menu "Save As" command, a dialog box will pop up in the bottom there is a "code" in the drop-down bar.

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding. For the English files are ASCII encoding for Simplified Chinese document is GB2312 encoding (Simplified Chinese version only for Windows, if it is Traditional Chinese will use Big5 code).

2) Unicode encoding refers to the UCS-2 encoding, namely the direct use of two-byte Unicode character code is stored. This option is used in little endian format.

3) Unicode big endian encoded with a corresponding option. I'll explain in the next section big endian and little endian meaning.

4) UTF-8 encoding, which is an encoding method discussed on.

Select End "coding", click "Save" button, the file encoding immediately convert better.

7. Little endian and Big endian

As mentioned in the previous section, Unicode code can be used directly stored in UCS-2 format. Chinese character "strict" for example, Unicode code is 4E25, requires two bytes of storage, a byte is 4E, another byte is 25. When stored, 4E front, 25 in the post, is Big endian mode; 25 forward, 4E in the post, that is Little endian mode.

These two strange name comes from the British writer Jonathan Swift's "Gulliver's Travels." In the book, Lilliput Lane civil war broke out, the causes of war is that people debate whether to eat eggs from the bulk (Big-Endian) knocked from the first (Little-Endian) knocked. To this matter, before and after the war broke out six times, an emperor lost his life, another emperor lost his throne.

Thus, the first byte first, is a "big way" (Big endian), the second byte of the first is the "head way" (Little endian).

So naturally, there will be a question: how do you know your computer to a document in the end of what kind of encoded?

Unicode specification defines a front of each file were added to a character coding sequence, said the name of the character called "zero-width non-breaking space" (ZERO WIDTH NO-BREAK SPACE), represented by FEFF. This is exactly two bytes, and FF FE big than 1.

If the first two bytes of a text file is FE FF, it means that the file using the bulk mode; if the first two bytes FF FE, it means that the file is a small head way.

8. Examples

Here, for instance.

Open the "Notepad" program Notepad.exe, create a text file, the content is a "serious" word, followed by using ANSI, save Unicode, Unicode big endian and UTF-8 encoding.

Then, in the text editing software UltraEdit "Hex function," observe the internal encoding of the file.

1) ANSI: encoded file is two bytes "D1 CF", which is "strict" GB2312 encoding, which also implies the use of GB2312 is stored in bulk.

2) Unicode: encoding is four bytes "FF FE 25 4E", where "FF FE" indicates a way to store small head, the actual encoding is 4E25.

3) Unicode big endian: encoding is four bytes "FE FF 4E 25", where "FE FF" indicates that the bulk stored.

4) UTF-8: Coding is six bytes "EF BB BF E4 B8 A5", the first three bytes of "EF BB BF" indicates that this is UTF-8 encoding, after three "E4B8A5" is the "strict" specific coding, its storage order is consistent with the coding sequence.
     
         
         
         
  More:      
 
- Two strokes to improve development productivity Struts2 (Programming)
- Linux systems use logwatch log file monitoring (Linux)
- 5 interesting Linux command line tips (Linux)
- Compile and install Ubuntu Linux 4.0.5 kernel, network and fix vmware kernel module compilation error (Linux)
- Linux operation and maintenance engineers face questions Intermediate (Linux)
- Android development environment to build under Fedora 13 (Linux)
- Java implementation linear table - represents the order of representation and chain (Programming)
- Setting up Linux machine through a proxy firewall (Linux)
- Git large file storage will help handle large binary files (Linux)
- for (int a: i) the use of the Java programming (Programming)
- Ubuntu Tutorial: How to Upgrade a New Linux Kernel 3.12.7 on Ubuntu (Linux)
- See Shell Script Linux Server network traffic (Server)
- Singleton (Linux)
- CentOS 5.11 Open VNC access (Linux)
- Distributed transaction management Spring declarative transactions (Programming)
- Docker installation under CentOS7 (Linux)
- Processor in protected mode of protection (Linux)
- Linux System Getting Started Learning: the Linux Wireshark interface dead solve (Linux)
- Kernel compile under Debian (Linux)
- The first deployment of cross-platform operation Rafy record (Server)
     
           
     
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.