Opened 18 years ago

Closed 18 years ago

#838 closed bug (invalid)

UTF8ToCharCode and UTF8ToLength for 5 and 6 byte characters

Reported by: marcusoverhagen Owned by: axeld
Priority: normal Milestone: R1
Component: Kits/Interface Kit Version:
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

I think UTF8ToCharCode and UTF8ToLength should be changed to support 5 and 6 byte character sequences.

RFC 3629 states:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 
accessible range) are encoded using sequences of 1 to 4 octets.

It does not limit UTF-8 to that range.

also:

[...] the ISO/IEC 10646 description of UTF-8 allows encoding character 
numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes.

Thers a nice overview of that I found on this page.

Change History (1)

comment:1 by axeld, 18 years ago

Resolution: invalid
Status: newclosed

Quoting from The Unicode Standard, version 4.0, section C.3:

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as an encoding form of Unicode characters."

So I think we can safely handle those characters as invalid.

Note: See TracTickets for help on using tickets.