Ticket #838 (closed bug: invalid)

Opened 4 years ago

Last modified 4 years ago

UTF8ToCharCode and UTF8ToLength for 5 and 6 byte characters

Reported by: marcusoverhagen Owned by: axeld
Priority: normal Milestone: R1
Component: Kits/Interface Kit Version:
Keywords: Cc:
Blocked By: Platform: All
Blocking:

Description

I think UTF8ToCharCode and UTF8ToLength should be changed to support 5 and 6 byte character sequences.

RFC 3629 states:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 
accessible range) are encoded using sequences of 1 to 4 octets.

It does not limit UTF-8 to that range.

also:

[...] the ISO/IEC 10646 description of UTF-8 allows encoding character 
numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes.

Thers a nice  overview of that I found on  this page.

Change History

Changed 4 years ago by axeld

  • status changed from new to closed
  • resolution set to invalid

Quoting from The Unicode Standard, version 4.0, section C.3:

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as an encoding form of Unicode characters."

So I think we can safely handle those characters as invalid.

Note: See TracTickets for help on using tickets.