Opened 14 years ago
Closed 5 years ago
#6447 closed bug (fixed)
StyledEdit doesn't check for valid utf-8
Reported by: | zooey | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | R1/beta2 |
Component: | Applications/StyledEdit | Version: | R1/alpha2 |
Keywords: | Cc: | ||
Blocked By: | #9395 | Blocking: | #7954 |
Platform: | All |
Description
If you use StyledEdit to open a file that doesn't contain utf-8 but text encoded in a different charset, BTextView will get confused about the character widths and strange things will happen when editing near the bogus characters.
Easily reproducible with /boot/system/data/Canna/default/default.canna.
So I guess StyledEdit (and the file-based BTextView::SetText()) should check the file content for being valid utf-8 and complain if it isn't.
Attachments (3)
Change History (21)
comment:1 by , 12 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:2 by , 12 years ago
Hm, looks like this issue (also both #3065 and #7954) is by design of our StyledText data translator.
One of possible solutions can be adding STXT translator option - "Replace bogus UTF-8 data with ..." and fix the translator to check for invalid utf-8 sequences, non-printable and '\0' characters before the real end of the data stream. After this we can implement the following: 1) Set the "replace bogus data" option "off" and try to read text file; 2) If step 1 has failed - nag user with the question - "File contains invalid characters. Proceed with replacing bogus data?"; 3) If user agreed to replace - switch the option "on" and try to load text file again;
Are there any opinions about proposed solution?
PS: Please share your ideas before GCI competition starts at Monday 26 Nov. :-) Thank you.
comment:5 by , 12 years ago
Blocking: | 6252 added |
---|
comment:6 by , 12 years ago
Blocked By: | 9395 added |
---|
follow-up: 8 comment:7 by , 12 years ago
I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?
by , 12 years ago
Attachment: | MediaPlayer_subtitles.png added |
---|
by , 12 years ago
Attachment: | StyleEdit.png added |
---|
follow-up: 9 comment:8 by , 12 years ago
Replying to Premislaus:
I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?
I have no idea about the MediaPlayer but SE one can be related. Please attach (at least starting part) of this document too.
follow-up: 10 comment:9 by , 12 years ago
Replying to siarzhuk:
Replying to Premislaus:
I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?
I have no idea about the MediaPlayer but SE one can be related. Please attach (at least starting part) of this document too.
OK.
StyleEdit.png is a screenshot from RTF file.
by , 12 years ago
Attachment: | text_example added |
---|
follow-up: 11 comment:10 by , 12 years ago
Replying to Premislaus:
StyleEdit.png is a screenshot from RTF file.
Does it means that you have opened RTF file with StyledEdit and save it as text_example in utf-8 encoding? The text_example rendering in the Trac viewer and locally on my Windows is identical to your StyledEdit.png screenshot. Looks like the problem is somewhere on the way from rtf to utf-8 text. The RTF translator problem maybe?
comment:11 by , 12 years ago
Replying to siarzhuk:
Replying to Premislaus:
StyleEdit.png is a screenshot from RTF file.
Does it means that you have opened RTF file with StyledEdit and save it as text_example in utf-8 encoding? The text_example rendering in the Trac viewer and locally on my Windows is identical to your StyledEdit.png screenshot. Looks like the problem is somewhere on the way from rtf to utf-8 text. The RTF translator problem maybe?
Yeah! This RTF file works fine under Windows, but displays the wrong characters in Haiku. The problem is with all text files that I have. I think it's a general problem, not only with rtf. I have problems with the txt and srt files, etc..
follow-ups: 13 14 comment:12 by , 12 years ago
Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.
For RTF, the case is different, though, as it contains the character encoding. So if an RTF is shown with the wrong encoding, it's a bug in the translator.
comment:13 by , 12 years ago
Replying to axeld:
Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.
For RTF, the case is different, though, as it contains the character encoding. So if an RTF is shown with the wrong encoding, it's a bug in the translator.
This is a Microsoft fault and their stupid encodings. I created a new ticket... #9654
comment:14 by , 12 years ago
Replying to axeld:
Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.
By the way, ICU has character detectors and we can profit from this by wrapping those features in the Local kit. This is what #9395 is about. Anyway STXTTranslator do some homebrew rudimentary assumption about encoding. So, IMO we have to push hands in the dirt sooner or later.
comment:15 by , 10 years ago
Owner: | changed from | to
---|
Those were taken some years ago as potential GCI tasks. Unfortunately no place is available for them in my schedule at this days.
comment:16 by , 8 years ago
Blocking: | 6252 removed |
---|
comment:18 by , 5 years ago
Milestone: | R1 → R1/beta2 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
This ticket was published as GCI 2012 task. http://google-melange.appspot.com/gci/task/view/google/gci2012/7986202 So I take ownership of it.