Opened 8 years ago

Closed 8 years ago

#13184 closed bug (fixed)

Infinite loop with bash/readline/ICU on non-BMP Unicode characters

Reported by: jessicah Owned by: pulkomandy
Priority: normal Milestone: Unscheduled
Component: Kits/Locale Kit Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

When trying to cd into a directory with characters outside of the Unicode Basic Multilingual Plane, it triggers an infinite loop parsing the characters.

I've tested the same on Ubuntu, which handles the characters fine. Note that they're not printable without a suitable font, but Haiku should still handle them correctly irregardless.

	thread 835: bash (main)
		state: Debugged

		Frame		IP			Function Name
		-----------------------------------------------
		0x7f5cf19f9fe0	0x12d46e62700	ucnv_toUnicode_57 + 0x60 
			Disassembly:
				ucnv_toUnicode_57:
				0x0000012d46e626a0:               55  push %rbp
				0x0000012d46e626a1:           4889e5  mov %rsp, %rbp
				0x0000012d46e626a4:             4157  push %r15
				0x0000012d46e626a6:             4156  push %r14
				0x0000012d46e626a8:             4155  push %r13
				0x0000012d46e626aa:             4154  push %r12
				0x0000012d46e626ac:               53  push %rbx
				0x0000012d46e626ad:         4883ec68  sub $0x68, %rsp
				0x0000012d46e626b1:         488b5d18  mov 0x18(%rbp), %rbx
				0x0000012d46e626b5:         4c894d88  mov %r9, -0x78(%rbp)
				0x0000012d46e626b9:         448b4d10  mov 0x10(%rbp), %r9d
				0x0000012d46e626bd:           4885db  test %rbx, %rbx
				0x0000012d46e626c0:             7408  jz 0x46e626ca
				0x0000012d46e626c2:           448b13  mov (%rbx), %r10d
				0x0000012d46e626c5:           4585d2  test %r10d, %r10d
				0x0000012d46e626c8:             7e16  jle 0x46e626e0
				0x0000012d46e626ca:         4883c468  add $0x68, %rsp
				0x0000012d46e626ce:               5b  pop %rbx
				0x0000012d46e626cf:             415c  pop %r12
				0x0000012d46e626d1:             415d  pop %r13
				0x0000012d46e626d3:             415e  pop %r14
				0x0000012d46e626d5:             415f  pop %r15
				0x0000012d46e626d7:               5d  pop %rbp
				0x0000012d46e626d8:               c3  ret 
				0x0000012d46e626d9:   0f1f8000000000  nop (%rax)
				0x0000012d46e626e0:           4885f6  test %rsi, %rsi
				0x0000012d46e626e3:           4989d4  mov %rdx, %r12
				0x0000012d46e626e6:           4989f5  mov %rsi, %r13
				0x0000012d46e626e9:           0f94c2  setz %dl
				0x0000012d46e626ec:           4885c9  test %rcx, %rcx
				0x0000012d46e626ef:           4989ce  mov %rcx, %r14
				0x0000012d46e626f2:           0f94c0  setz %al
				0x0000012d46e626f5:             08c2  or %al, %dl
				0x0000012d46e626f7:     0f8533010000  jnz 0x12d46e62830
				0x0000012d46e626fd:           4885ff  test %rdi, %rdi
				0x0000012d46e62700:           4989ff  mov %rdi, %r15 <--

			Frame memory:
				[0x7f5cf19f9f40]  ........c..'....   00 00 00 00 00 00 00 00 63 04 a8 27 80 01 00 00
				[0x7f5cf19f9f50]  ................   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
				[0x7f5cf19f9f60]  8...\.......\...   38 00 00 f1 5c 7f 00 00 a0 a1 9f f1 5c 7f 00 00
				[0x7f5cf19f9f70]  ..6}......6}....   a1 dc 36 7d 00 02 00 00 c2 dc 36 7d 00 02 00 00
				[0x7f5cf19f9f80]  ....\...E..'....   e0 9f 9f f1 5c 7f 00 00 45 13 a8 27 80 01 00 00
				[0x7f5cf19f9f90]  ........(...\...   00 00 00 00 00 00 00 00 28 a0 9f f1 5c 7f 00 00
				[0x7f5cf19f9fa0]  ........P...\...   08 04 00 00 00 00 00 00 50 a1 9f f1 5c 7f 00 00
				[0x7f5cf19f9fb0]  ..6}....8.9}....   a1 dc 36 7d 00 02 00 00 38 c9 39 7d 00 02 00 00
				[0x7f5cf19f9fc0]  ....\...........   10 a0 9f f1 5c 7f 00 00 00 00 00 00 00 00 00 00
				[0x7f5cf19f9fd0]  `...\......'....   60 a0 9f f1 5c 7f 00 00 8c 14 a8 27 80 01 00 00
		0x7f5cf19fa070	0x18027a81487	BPrivate::Libroot::ICUCtypeData::MultibyteToWchar(wchar_t*, char const*, unsigned long, mbstate_t*, unsigned long&) + 0x79 
		0x7f5cf19fa0c0	0x18027a821b9	BPrivate::Libroot::ICULocaleBackend::MultibyteToWchar(wchar_t*, char const*, unsigned long, mbstate_t*, unsigned long&) + 0x45 
		0x7f5cf19fa0e0	0x8eb226e146	mbrtowc + 0xd9 
		0x7f5cf19fa0f0	0x8eb226e066	__mbrlen + 0x20 
		0x7f5cf19fa130	0x1a3cd01ac07	_rl_adjust_point(char*, int, mbstate_t*) + 0x61 (../mbutil.c:-1)

Attachments (4)

bash-835-debug-06-01-2017-14-59-39.report (14.2 KB ) - added by jessicah 8 years ago.
bash-1053-debug-06-01-2017-14-50-45.report (11.3 KB ) - added by jessicah 8 years ago.
directory-with-non-BMP-characters.zip (325 bytes ) - added by jessicah 8 years ago.
0001-MultibyteToWchar-use-ucnv_getNextUChar.patch (2.2 KB ) - added by jessicah 8 years ago.

Download all attachments as: .zip

Change History (13)

comment:1 by jessicah, 8 years ago

Oh, and the name of the directory: 你好 。 𤭢 𤭢 𤭢 𤭢 --- hopefully it comes through correctly...

by jessicah, 8 years ago

comment:2 by pulkomandy, 8 years ago

Component: Applications/TerminalKits/Locale Kit
Owner: changed from jackburton to pulkomandy

Terminal is not involved here, changing component.

There are other known issues with mbrtowc, see #12515. We need to review and fix our widechar support.

in reply to:  2 comment:3 by jessicah, 8 years ago

Replying to pulkomandy:

Terminal is not involved here, changing component.

There are other known issues with mbrtowc, see #12515. We need to review and fix our widechar support.

Ah, thanks! Yeah, I had a mind blank trying to figure out the right component =/ Also, just linking to #6276 here for my reference, so can dig a bit deeper into the whole stack :)

comment:4 by pulkomandy, 8 years ago

From what I remember (we can meet on IRC later if you need me to dig more into things):

  • Our locale implementation relies on ICU to perform most of the work
  • There are some tricks so that libroot can do this without linking directly to libicu (http://cgit.haiku-os.org/haiku/tree/src/system/libroot/add-ons/icu)
  • There are also some hacks to get gcc2 to have some wchar support, and some rewrite of our glibc involved at the time to get this going. This is our own code and there can be bugs in it (unfortunately widechar C API does not seem to get much use)

Things to look for:

  • How contexts for ICU are allocated/cleared when using the C APIs
  • Making sure that ICU does not try to use C library functionality to do its things, resulting in infinite recursion or other weirdness (this happened at least once)
  • Making sure we didn't break anything when backporting ICU to gcc2 (there is a quite large patch at haikuporter for that)
  • Making sure ICU itself works when used directly. It needs the relevant data files, locale settings, etc.
Last edited 8 years ago by pulkomandy (previous) (diff)

comment:5 by jessicah, 8 years ago

patch: 01

comment:6 by jessicah, 8 years ago

I've found that replacing ucnv_toUnicode() with ucnv_getNextUChar() fixes the infinite loop with bash, so there is definitely a bug in our current ICUCtypeData::MultibyteToWchar() function.

The buffer string that bash/readline iterates through (annotated):

"x2f  \x20  \xe4\xbd\xa0  \xe5\xa5\xbd  \x20   \xe3\x80\x82  \x20   \xf0\xa4\xad\xa2  \x20 ..."
 '\'  ' '   '你'          '好'          ' '    '。'           ' '    '𤭢'              ' '  ..."

As it steps through successive calls to MultibyteToWchar, I noticed a slight oddity in the values of sourceLengthUsed & targetLengthUsed.

"\xe4\xbd\xa0",     sourceLengthUsed = 3, targetLengthUsed = 1, unicodeChar = 20320
"\xe5\xa5\xbd",     sourceLengthUsed = 3, targetLengthUsed = 1, unicodeChar = 22909
"\x20",             sourceLengthUsed = 1, targetLengthUsed = 1, unicodeChar = 32
"\xe3\x80\x82",     sourceLengthUsed = 3, targetLengthUsed = 1, unicodeChar = 12290
"\x20",             sourceLengthUsed = 1, targetLengthUsed = 1, unicodeChar = 32
"\xf0\xa4\xad\xa2", sourceLengthUsed = 4, targetLengthUsed = 1, unicodeChar = 55378
"\x20",             sourceLengthUsed = 0, targetLengthUsed = 1, unicodeChar = 57186 <---
"\x20",             sourceLengthUsed = 1, targetLengthUsed = 1, unicodeChar = 32

It seems like when we're using ucnv_toUnicode, we aren't doing a proper conversion to UTF32 for characters outside of the BMP...

comment:7 by pulkomandy, 8 years ago

This sequence looks right, but the output from ICU isn't UTF32, UChar is a 16bit type. As a result, \xf0\xa4\xad\xa2 is consumed, and two UChars are output: 55378, then 57186 (which does not need to use any extra input bytes). Then the conversion continues with the next character.

If we want UTF32, then this is not the right function to use.

ucnv_getNextUChar, on the other hand, does return an UChar32 (talk about a confusing naming for that funnction...), so indeed that fixes the problem. However, it does not work for streaming conversion.

So the generic solution would be to use ucnv_convertEx, and ask for an output in UTF32.

Last edited 8 years ago by pulkomandy (previous) (diff)

comment:8 by jessicah, 8 years ago

Mm, I added some more tracing to my modified version; indeed, using ucnv_getNextUChar() returns the single value 150370 instead, and source length is still 4.

I'm not sure if using ucnv_getNextUChar() is the right fix here though. Not sure this would work correctly on invalid sequences, as required by mbrtowc.

Also, I've noticed in WcharToMultibyte() that we a) convert wchar_t from UTF-32 to UTF-16, and then operate on UTF-16 for doing the actual conversion.

So, if I'm understanding WcharToMultibyte() correctly, then apparently our wchar_t is indeed UTF-32, not UTF-16, which means our MultibyteToWchar should handle UTF-16 surrogate pairs correctly. Seeing how WcharToMultibyte() is implemented, I think I may be able to provide a proper patch :-)

comment:9 by jessicah, 8 years ago

Resolution: fixed
Status: newclosed

Fixed in hrev50859.

Note: See TracTickets for help on using tickets.