Opened 8 years ago
Closed 8 years ago
#13184 closed bug (fixed)
Infinite loop with bash/readline/ICU on non-BMP Unicode characters
Reported by: | jessicah | Owned by: | pulkomandy |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | Kits/Locale Kit | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
When trying to cd into a directory with characters outside of the Unicode Basic Multilingual Plane, it triggers an infinite loop parsing the characters.
I've tested the same on Ubuntu, which handles the characters fine. Note that they're not printable without a suitable font, but Haiku should still handle them correctly irregardless.
thread 835: bash (main) state: Debugged Frame IP Function Name ----------------------------------------------- 0x7f5cf19f9fe0 0x12d46e62700 ucnv_toUnicode_57 + 0x60 Disassembly: ucnv_toUnicode_57: 0x0000012d46e626a0: 55 push %rbp 0x0000012d46e626a1: 4889e5 mov %rsp, %rbp 0x0000012d46e626a4: 4157 push %r15 0x0000012d46e626a6: 4156 push %r14 0x0000012d46e626a8: 4155 push %r13 0x0000012d46e626aa: 4154 push %r12 0x0000012d46e626ac: 53 push %rbx 0x0000012d46e626ad: 4883ec68 sub $0x68, %rsp 0x0000012d46e626b1: 488b5d18 mov 0x18(%rbp), %rbx 0x0000012d46e626b5: 4c894d88 mov %r9, -0x78(%rbp) 0x0000012d46e626b9: 448b4d10 mov 0x10(%rbp), %r9d 0x0000012d46e626bd: 4885db test %rbx, %rbx 0x0000012d46e626c0: 7408 jz 0x46e626ca 0x0000012d46e626c2: 448b13 mov (%rbx), %r10d 0x0000012d46e626c5: 4585d2 test %r10d, %r10d 0x0000012d46e626c8: 7e16 jle 0x46e626e0 0x0000012d46e626ca: 4883c468 add $0x68, %rsp 0x0000012d46e626ce: 5b pop %rbx 0x0000012d46e626cf: 415c pop %r12 0x0000012d46e626d1: 415d pop %r13 0x0000012d46e626d3: 415e pop %r14 0x0000012d46e626d5: 415f pop %r15 0x0000012d46e626d7: 5d pop %rbp 0x0000012d46e626d8: c3 ret 0x0000012d46e626d9: 0f1f8000000000 nop (%rax) 0x0000012d46e626e0: 4885f6 test %rsi, %rsi 0x0000012d46e626e3: 4989d4 mov %rdx, %r12 0x0000012d46e626e6: 4989f5 mov %rsi, %r13 0x0000012d46e626e9: 0f94c2 setz %dl 0x0000012d46e626ec: 4885c9 test %rcx, %rcx 0x0000012d46e626ef: 4989ce mov %rcx, %r14 0x0000012d46e626f2: 0f94c0 setz %al 0x0000012d46e626f5: 08c2 or %al, %dl 0x0000012d46e626f7: 0f8533010000 jnz 0x12d46e62830 0x0000012d46e626fd: 4885ff test %rdi, %rdi 0x0000012d46e62700: 4989ff mov %rdi, %r15 <-- Frame memory: [0x7f5cf19f9f40] ........c..'.... 00 00 00 00 00 00 00 00 63 04 a8 27 80 01 00 00 [0x7f5cf19f9f50] ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [0x7f5cf19f9f60] 8...\.......\... 38 00 00 f1 5c 7f 00 00 a0 a1 9f f1 5c 7f 00 00 [0x7f5cf19f9f70] ..6}......6}.... a1 dc 36 7d 00 02 00 00 c2 dc 36 7d 00 02 00 00 [0x7f5cf19f9f80] ....\...E..'.... e0 9f 9f f1 5c 7f 00 00 45 13 a8 27 80 01 00 00 [0x7f5cf19f9f90] ........(...\... 00 00 00 00 00 00 00 00 28 a0 9f f1 5c 7f 00 00 [0x7f5cf19f9fa0] ........P...\... 08 04 00 00 00 00 00 00 50 a1 9f f1 5c 7f 00 00 [0x7f5cf19f9fb0] ..6}....8.9}.... a1 dc 36 7d 00 02 00 00 38 c9 39 7d 00 02 00 00 [0x7f5cf19f9fc0] ....\........... 10 a0 9f f1 5c 7f 00 00 00 00 00 00 00 00 00 00 [0x7f5cf19f9fd0] `...\......'.... 60 a0 9f f1 5c 7f 00 00 8c 14 a8 27 80 01 00 00 0x7f5cf19fa070 0x18027a81487 BPrivate::Libroot::ICUCtypeData::MultibyteToWchar(wchar_t*, char const*, unsigned long, mbstate_t*, unsigned long&) + 0x79 0x7f5cf19fa0c0 0x18027a821b9 BPrivate::Libroot::ICULocaleBackend::MultibyteToWchar(wchar_t*, char const*, unsigned long, mbstate_t*, unsigned long&) + 0x45 0x7f5cf19fa0e0 0x8eb226e146 mbrtowc + 0xd9 0x7f5cf19fa0f0 0x8eb226e066 __mbrlen + 0x20 0x7f5cf19fa130 0x1a3cd01ac07 _rl_adjust_point(char*, int, mbstate_t*) + 0x61 (../mbutil.c:-1)
Attachments (4)
Change History (13)
by , 8 years ago
Attachment: | bash-835-debug-06-01-2017-14-59-39.report added |
---|
by , 8 years ago
Attachment: | bash-1053-debug-06-01-2017-14-50-45.report added |
---|
comment:1 by , 8 years ago
by , 8 years ago
Attachment: | directory-with-non-BMP-characters.zip added |
---|
follow-up: 3 comment:2 by , 8 years ago
Component: | Applications/Terminal → Kits/Locale Kit |
---|---|
Owner: | changed from | to
Terminal is not involved here, changing component.
There are other known issues with mbrtowc, see #12515. We need to review and fix our widechar support.
comment:3 by , 8 years ago
Replying to pulkomandy:
Terminal is not involved here, changing component.
There are other known issues with mbrtowc, see #12515. We need to review and fix our widechar support.
Ah, thanks! Yeah, I had a mind blank trying to figure out the right component =/ Also, just linking to #6276 here for my reference, so can dig a bit deeper into the whole stack :)
comment:4 by , 8 years ago
From what I remember (we can meet on IRC later if you need me to dig more into things):
- Our locale implementation relies on ICU to perform most of the work
- There are some tricks so that libroot can do this without linking directly to libicu (http://cgit.haiku-os.org/haiku/tree/src/system/libroot/add-ons/icu)
- There are also some hacks to get gcc2 to have some wchar support, and some rewrite of our glibc involved at the time to get this going. This is our own code and there can be bugs in it (unfortunately widechar C API does not seem to get much use)
Things to look for:
- How contexts for ICU are allocated/cleared when using the C APIs
- Making sure that ICU does not try to use C library functionality to do its things, resulting in infinite recursion or other weirdness (this happened at least once)
- Making sure we didn't break anything when backporting ICU to gcc2 (there is a quite large patch at haikuporter for that)
- Making sure ICU itself works when used directly. It needs the relevant data files, locale settings, etc.
by , 8 years ago
Attachment: | 0001-MultibyteToWchar-use-ucnv_getNextUChar.patch added |
---|
comment:5 by , 8 years ago
patch: | 0 → 1 |
---|
comment:6 by , 8 years ago
I've found that replacing ucnv_toUnicode()
with ucnv_getNextUChar()
fixes the infinite loop with bash, so there is definitely a bug in our current ICUCtypeData::MultibyteToWchar() function.
The buffer string that bash/readline iterates through (annotated):
"x2f \x20 \xe4\xbd\xa0 \xe5\xa5\xbd \x20 \xe3\x80\x82 \x20 \xf0\xa4\xad\xa2 \x20 ..." '\' ' ' '你' '好' ' ' '。' ' ' '𤭢' ' ' ..."
As it steps through successive calls to MultibyteToWchar
, I noticed a slight oddity in the values of sourceLengthUsed & targetLengthUsed.
"\xe4\xbd\xa0", sourceLengthUsed = 3, targetLengthUsed = 1, unicodeChar = 20320 "\xe5\xa5\xbd", sourceLengthUsed = 3, targetLengthUsed = 1, unicodeChar = 22909 "\x20", sourceLengthUsed = 1, targetLengthUsed = 1, unicodeChar = 32 "\xe3\x80\x82", sourceLengthUsed = 3, targetLengthUsed = 1, unicodeChar = 12290 "\x20", sourceLengthUsed = 1, targetLengthUsed = 1, unicodeChar = 32 "\xf0\xa4\xad\xa2", sourceLengthUsed = 4, targetLengthUsed = 1, unicodeChar = 55378 "\x20", sourceLengthUsed = 0, targetLengthUsed = 1, unicodeChar = 57186 <--- "\x20", sourceLengthUsed = 1, targetLengthUsed = 1, unicodeChar = 32
It seems like when we're using ucnv_toUnicode, we aren't doing a proper conversion to UTF32 for characters outside of the BMP...
comment:7 by , 8 years ago
This sequence looks right, but the output from ICU isn't UTF32, UChar is a 16bit type. As a result, \xf0\xa4\xad\xa2 is consumed, and two UChars are output: 55378, then 57186 (which does not need to use any extra input bytes). Then the conversion continues with the next character.
If we want UTF32, then this is not the right function to use.
ucnv_getNextUChar, on the other hand, does return an UChar32 (talk about a confusing naming for that funnction...), so indeed that fixes the problem. However, it does not work for streaming conversion.
So the generic solution would be to use ucnv_convertEx, and ask for an output in UTF32.
comment:8 by , 8 years ago
Mm, I added some more tracing to my modified version; indeed, using ucnv_getNextUChar()
returns the single value 150370 instead, and source length is still 4.
I'm not sure if using ucnv_getNextUChar()
is the right fix here though. Not sure this would work correctly on invalid sequences, as required by mbrtowc
.
Also, I've noticed in WcharToMultibyte()
that we a) convert wchar_t from UTF-32 to UTF-16, and then operate on UTF-16 for doing the actual conversion.
So, if I'm understanding WcharToMultibyte()
correctly, then apparently our wchar_t is indeed UTF-32, not UTF-16, which means our MultibyteToWchar
should handle UTF-16 surrogate pairs correctly. Seeing how WcharToMultibyte()
is implemented, I think I may be able to provide a proper patch :-)
Oh, and the name of the directory:
你好 。 𤭢 𤭢 𤭢 𤭢
--- hopefully it comes through correctly...