Opened 18 years ago
Closed 18 years ago
#838 closed bug (invalid)
UTF8ToCharCode and UTF8ToLength for 5 and 6 byte characters
Reported by: | marcusoverhagen | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | Kits/Interface Kit | Version: | |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
I think UTF8ToCharCode and UTF8ToLength should be changed to support 5 and 6 byte character sequences.
RFC 3629 states:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.
It does not limit UTF-8 to that range.
also:
[...] the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes.
Thers a nice overview of that I found on this page.
Note:
See TracTickets
for help on using tickets.
Quoting from The Unicode Standard, version 4.0, section C.3:
"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as an encoding form of Unicode characters."
So I think we can safely handle those characters as invalid.