UTF-8 for Communication |
|
One of the many issued faced in doing certain kinds of data transfer is the problem of many transportation levels being unable to handled embedded NUL (0) bytes. When working with data in many languages, Unicode is a nice internal representation for characters, but is untenable for transmitting data to programs that expect 8-bit characters without embedded NULs. You may need to store the characters in files, or transmit them over a network, or use them in some other way which is at least compatible with programs expecting 8-bit byte sequences.
One of the problems you have to address is what it means to have the different encodings. In the case of using UTF-16, it means your file or network transmission will contain embedded 0 bytes. Some applications, and some protocols, do not like this. Some programs that read files treat the characters as if they are 8-bit characters, at the lower layers of doing I/O. Therefore, these lower-level layers know that a 0 byte is an end-of-string. The result will be an unfortunate truncation of the data; by the time you see it at the higher levels of your application, where you want to treat it as Unicode, it has been hopelessly corrupted. In other cases, the representation is "optimized" by lower-level layers of the input subroutine, and those "useless" 0 bytes are simply discarded. Now, while your data is not truncated, it is still hopelessly corrupted, and useless.
UTF-8 encoding solves these problems, because it does not have any embedded 0 bytes, and the encoding is discussed in more detail below. However, it does require that subsequent applications that will process the data are prepared to use UTF-8 data. Otherwise, the representation is not useful. When a single program, or a suite of programs, written by a single programmer or organization, is involved, it is easy to declare that UTF-8 is the external representation of choice. It is harder to deal with when your application lives in an ecosystem with many other programs, written by end users, Microsoft, and other vendors.
Some programs, like Microsoft's notepad, are willing to accept 8-bit ANSI files, 8-bit UTF-8 files, and 16-bit Unicode files, and therefore there is no particular problem. But you have to make the final assessment as to the viability of these various representation choices for your own application. In some unfortunate cases, you have to have UTF-8 because program X will not handle Unicode properly; but the data also has to be handled by program Y, which cannot handle UTF-8! There is a technical description of this problem, which is "unresolvable". You may have to write two different representations in two different files, always a dangerous situation, or you may have to write a replacement for one of the applications, a potentially very expensive exercise.
A problem with UTF-16 is that it represents characters as 16-bit values. That is, in the abstract, the letter 'A' is represented by the bit sequence 0000 0000 0100 0001 (0x0041). But different architectures represent multibyte values in different ways. Machines like the x86 family are known as "little-endian" machines, because when you address a multibyte value with an instruction, you address the low-order byte, and successively increasing memory addresses proceed to the high-order byte. Other machines, such as IBM mainframes, the Motorola 68x00 family, and the SPARC are known as "big-endian" machines because when you address a multibyte value, you address the high-order byte, and successively increasing memory addresses proceed to the low-order byte. Since a UTF-16 encoding is a multibyte operation, each character can have two representations
Representation | High-order | Low-order |
Abstract numerical value of 0x0041 | 00 | 41 |
Representation | First byte | Second byte |
Big-endian representation | 00 | 41 |
Little-endian machine | 41 | 00 |
Thus, if you write the sequence "ABC" out to a file, an x86 will write the sequence 41 00 42 00 43 00, while a big-endian machine will write the sequence 00 41 00 42 00 43. Obviously, interpreting one sequence as the other will result in serious confusion.
To help allay this confusion, the "Byte order mark" (BOM), is by tradition, written as the first byte. The Byte Order Mark is the character 0xFEFF. When written out by a little-endian machine, the sequence would be FF FE. Thus "ABC" written from a little-endian machine would be represented as FF FE 41 00 42 00 43 00. When the first two bytes are read and discovered to be FF FE, the file is known to be a little-endian UTF-16 encoding. However, a big-endian machine, writing this sequence, would have written FE FF 00 41 00 42 00 43. By reading the first two bytes and discovering they are FE FF, you would know that the file has been written in Unicode using a big-endian machine. You would then have to swap the bytes of each character pair before attempting to use them for anything practical.
Representation | High-order | Low-order |
Abstract numerical value of 0xFEFF | FE | FF |
Representation | First byte | Second byte |
Big-endian representation | FE | FF |
Little-endian machine | FF | FE |
You would also have to discard the byte order mark, since it is not actually data, but metadata (data about data), and does not contribute to the content of the file. By tradition, the BOM is the first character of a file or transmission, but would not apply to any other part of the file, or be placed at the front of each data packet, unless there was a higher-level protocol that identified "logical packet" boundaries.
There is even a "UTF-8 BOM", which is a bit of a misnomer because UTF-8 has no byte order issues; but including the byte sequence EF BB BF at the front of a file will traditionally indicate that the file is using UTF-8 encoding. (See, for example, The Unicode Standard Version 3.0, section 13.6 (p. 324) for a more detailed discussion of byte order marks and their use and interpretation. As with the UTF-16 BOM, this sequence is considered metadata and would, if found at the front of a file, be discarded.
You have to decide, based on your specific needs, which representation is the best choice for text file or transfer representation. I tend to favor UTF-8 for these purposes. Databases, on the other hand, should typically use UTF-16 encoding (most popular database systems, such as SQL Server, only use UTF-16 text internally).
UTF-8 (originally, Unicode Transformation Format; now it is just an acronym for an encoding) is an 8-bit variable-length character encoding for Unicode characters.
Scalar Value |
UTF-8 Encoding | ||||||||||||
[2] | [1] | [0] | First byte | Second byte | Third byte | Fourth byte | |||||||
0000 | 0000 | 0xxx | xxxx | 0xxx | xxxx | ||||||||
0000 | 0yyy | yyxx | xxxx | 110y | yyyy | 10xx | xxxx | ||||||
zzzz | yyyy | yyx | xxxx | 1110 | zzzz | 10yy | yyyy | 10xx | xxxx | ||||
000u | uuuu | zzzz | yyyy | yyxx | xxxx | 1111 | 0uuu | 10uu | zzzz | 10yy | yyyy | 10xx | xxxx |
In UTF-8, every byte of the encoding is non-zero. The first nybble of the first byte gives the length of the encoding. If the high-order bit is 0, it is one byte. If the high-order three bits are 110, it is a two-byte encoding. If the high-order four bits are 1110, it is three bytes, and if the high-order 4 bits are 1111 (actually, the high-order 5 bits are 11110), the encoding is 4 bytes. Any byte whose high-order 2 bits are 10 is a byte within an encoding. Therefore, you can look at any byte of a string and quickly locate the start byte of the sequence you are in, or the start of the next sequence.
I show here two subroutines, which can be used in both Unicode (well, UTF-16) representations and ANSI representations. The techniques here use a CByteArray as a representation of the encoded data, although you could rewrite them to use a CString. The source code can be copied from this page, or it can be downloaded as part of another project on Asynchronous Socket Communication.
In VS.NET and later, you can use a CStringA type to pass the encoded information in, and you can produce the result encoding in a CStringA. This code was designed to work in VS6 as well, which does not have CStringA and CStringW data types, so I was limited in what I could code.
Key to understanding parts of this is that there is no direct ANSI-to-UTF-8 or UTF-8-to-ANSI conversion path. So if I have ANSI input or output, but want UTF-8, then I first must convert it to Unicode. Note also that it is possible to have a conversion failure because the string contains UTF-8 encoded characters cannot be converted to ANSI without losing data.
This code was constructed from schema generated by my Locale Explorer.
/**************************************************************************** * ConvertReceivedDataToString * Inputs: * CByteArray & data: Raw data in UTF-8 format * Result: CString * A string representing the data * Effect: * Converts the data from UTF-8 to ANSI or Unicode string * Notes: * To convert to ANSI, it is first turned into Unicode using CP_UTF8 * as the source page, then converted from Unicode to ANSI by using * CP_ACP as the target page ****************************************************************************/ CString ConvertReceivedDataToString(CByteArray & data) // [1] { // data is UTF-8 encoded CArray<WCHAR, WCHAR> wc; // [2] // First, compute the amount of space required. n will include the // space for the terminal NUL character INT_PTR n = ::MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)data.GetData(), (int)data.GetSize(), NULL, 0); // [3] if(n == 0) // [4] { /* failed */ DWORD err = ::GetLastError(); // [5] TRACE(_T("MultiByteToWideChar (1) returned error %d\n"), err); // [6] return CString(_T("")); // [7] } /* failed */ else { /* success */ wc.SetSize(n); // [8] n = ::MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)data.GetData(), (int)data.GetSize(), (LPWSTR)wc.GetData(), (int)n); // [9] if(n == 0) // [10] { /* failed */ DWORD err = ::GetLastError(); // [11] TRACE(_T("MultiByteToWideChar (2) returned error %d\n"), err); // [12] return CString(_T("")); // [13] } /* failed */ } /* success */ // Data is now in Unicode // If we are a Unicode app we are done // If we are an ANSI app, convert it back to ANSI #ifdef _UNICODE // If this is a Unicode app we are done return CString(wc.GetData(), (int)wc.GetSize()); // [14] #else // ANSI // Invert back to ANSI CString s; // [15] n = ::WideCharToMultiByte(CP_ACP, 0, (LPCWSTR)wc.GetData(), (int)wc.GetSize(), NULL, 0, NULL, NULL); // [16] if(n == 0) // [17] { /* failed */ DWORD err = ::GetLastError(); // [18] TRACE("WideCharToMultiByte (1) returned error %d\n", err); // [19] return CString(""); // [20] } /* failed */ else { /* success */ LPSTR p = s.GetBuffer((int)n); // [21] n = ::WideCharToMultiByte(CP_ACP, 0, wc.GetData(), (int)wc.GetSize(), p, (int)n, NULL, NULL); // [22] if(n == 0) // [23] { /* conversion failed */ DWORD err = ::GetLastError(); // [24] TRACE("WideCharToMultiByte (2) returned error %d\n", err); // [25] s.ReleaseBuffer(); // [26] return CString(""); // [27] } /* conversion failed */ s.ReleaseBuffer(); // [28] return s; } /* success */ #endif } // ConvertReceivedDataToString /**************************************************************************** * ConvertStringToSendData * Inputs: * const CString & s: String to send * CByteArray & msg: Place to format message * Result: BOOL * TRUE if successful * FALSE if error * Effect: * Converts the data to a byte stream for transmission ****************************************************************************/ BOOL ConvertStringToSendData(const CString & s, CByteArray & msg) { #ifdef _UNICODE int n = ::WideCharToMultiByte(CP_UTF8, 0, s, -1, NULL, 0, NULL, NULL); if(n == 0) { /* failed */ DWORD err = ::GetLastError(); msg.SetSize(0); return FALSE; } /* failed */ else { /* success */ msg.SetSize(n); n = ::WideCharToMultiByte(CP_UTF8, 0, s, -1, (LPSTR)msg.GetData(), n, NULL, NULL); if(n == 0) { /* conversion failed */ DWORD err = ::GetLastError(); msg.SetSize(0); return FALSE; } /* conversion failed */ else { /* use multibyte string */ msg.SetSize(n - 1); return TRUE; } /* use multibyte string */ } /* success */ #else // ANSI CArray<WCHAR, WCHAR> wc; int n = ::MultiByteToWideChar(CP_ACP, 0, s, -1, NULL, 0); if(n == 0) { /* failed */ DWORD err = ::GetLastError(); msg.SetSize(0); return FALSE; } /* failed */ else { /* success */ wc.SetSize(n); n = ::MultiByteToWideChar(CP_ACP, 0, s, -1, wc.GetData(), n); } /* success */ n = ::WideCharToMultiByte(CP_UTF8, 0, wc.GetData(), -1, NULL, 0, NULL, NULL); if(n == 0) { /* failed */ DWORD err = ::GetLastError(); msg.SetSize(0); return FALSE; } /* failed */ else { /* success */ msg.SetSize(n); n = ::WideCharToMultiByte(CP_UTF8, 0, wc.GetData(), -1, (LPSTR)msg.GetData(), n, NULL, NULL); if(n == 0) { /* conversion failed */ DWORD err = ::GetLastError(); msg.SetSize(0); return FALSE; } /* conversion failed */ else { /* use multibyte string */ msg.SetSize(n - 1); return TRUE; } /* use multibyte string */ } /* success */ #endif } // ConvertStringToSendData
The names "big-endian" and "little-endian" come from an era in which there was an actual belief that these ideas mattered in the slightest. Defenders of each architecture waxed wroth at the heretics who believed in the other arrangement of data. This was likened to the state of war described in Jonathon Swift's book, Gulliver's Travels, between the inhabitants of Lilliput and the neighboring kingdom of Blefuscu; one group believed in cracking their soft-boiled breakfast eggs at the little end, and the other group believed in cracking their eggs on the big end. Danny Cohen appealed to this satire (of the break between England and the Roman Catholic Church) in his article "On Holy Wars and a Plea for Peace" in his working note IEN 137 (http://www.ietf.org/rfc/ien/ien137.txt). Ultimately, it was proven simply by how we code things that (a) byte order of computers matters not in the slightest and (b) swapping it is trivial, nearly all the time.
Date |
Description |
4-Apr-07 |
First release, as an accompanying article on CAsyncSocket programming with multiple threads |
16-Dec-07 |
Revised and added more discussion about issues of UTF-8 vs. UTF-16, now more of a standalone article |
The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.