Context Navigation

#838 closed bug (invalid)

UTF8ToCharCode and UTF8ToLength for 5 and 6 byte characters

Reported by:	marcusoverhagen	Owned by:	axeld
Priority:	normal	Milestone:	R1
Component:	Kits/Interface Kit	Version:
Keywords:		Cc:
Blocked By:		Blocking:
Platform:	All

Description

I think UTF8ToCharCode and UTF8ToLength should be changed to support 5 and 6 byte character sequences.

RFC 3629 states:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 
accessible range) are encoded using sequences of 1 to 4 octets.

It does not limit UTF-8 to that range.

also:

[...] the ISO/IEC 10646 description of UTF-8 allows encoding character 
numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes.

Thers a nice overview of that I found on this page.

Change History (1)

comment:1 by axeld, 18 years ago

Resolution:	→ invalid
Status:	new → closed

Quoting from The Unicode Standard, version 4.0, section C.3:

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as an encoding form of Unicode characters."

So I think we can safely handle those characters as invalid.

Note: See TracTickets for help on using tickets.

Download in other formats: