Opened 14 years ago

Closed 4 years ago

#6447 closed bug (fixed)

StyledEdit doesn't check for valid utf-8

Reported by: zooey Owned by: nobody
Priority: normal Milestone: R1/beta2
Component: Applications/StyledEdit Version: R1/alpha2
Keywords: Cc:
Blocked By: #9395 Blocking: #7954
Platform: All

Description

If you use StyledEdit to open a file that doesn't contain utf-8 but text encoded in a different charset, BTextView will get confused about the character widths and strange things will happen when editing near the bogus characters.

Easily reproducible with /boot/system/data/Canna/default/default.canna.

So I guess StyledEdit (and the file-based BTextView::SetText()) should check the file content for being valid utf-8 and complain if it isn't.

Attachments (3)

MediaPlayer_subtitles.png (482.2 KB ) - added by Premislaus 12 years ago.
StyleEdit.png (169.6 KB ) - added by Premislaus 12 years ago.
text_example (13.5 KB ) - added by Premislaus 12 years ago.

Download all attachments as: .zip

Change History (21)

comment:1 by siarzhuk, 12 years ago

Owner: changed from korli to siarzhuk
Status: newassigned

This ticket was published as GCI 2012 task. http://google-melange.appspot.com/gci/task/view/google/gci2012/7986202 So I take ownership of it.

comment:2 by siarzhuk, 12 years ago

Hm, looks like this issue (also both #3065 and #7954) is by design of our StyledText data translator.

One of possible solutions can be adding STXT translator option - "Replace bogus UTF-8 data with ..." and fix the translator to check for invalid utf-8 sequences, non-printable and '\0' characters before the real end of the data stream. After this we can implement the following: 1) Set the "replace bogus data" option "off" and try to read text file; 2) If step 1 has failed - nag user with the question - "File contains invalid characters. Proceed with replacing bogus data?"; 3) If user agreed to replace - switch the option "on" and try to load text file again;

Are there any opinions about proposed solution?

PS: Please share your ideas before GCI competition starts at Monday 26 Nov. :-) Thank you.

comment:3 by siarzhuk, 12 years ago

Blocking: 3065 added

(In #3065) Potential duplicate of #6447

comment:4 by siarzhuk, 12 years ago

Blocking: 7954 added

(In #7954) Potential duplicate of #6447

comment:5 by siarzhuk, 12 years ago

Blocking: 6252 added

(In #6252) Related to problem discussed in #6447. In case mentioned powering STXT translator with ICU Charset Detector this one should be fixed too.

comment:6 by siarzhuk, 12 years ago

Blocked By: 9395 added

comment:7 by Premislaus, 12 years ago

I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?

by Premislaus, 12 years ago

Attachment: MediaPlayer_subtitles.png added

by Premislaus, 12 years ago

Attachment: StyleEdit.png added

in reply to:  7 ; comment:8 by siarzhuk, 12 years ago

Replying to Premislaus:

I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?

I have no idea about the MediaPlayer but SE one can be related. Please attach (at least starting part) of this document too.

in reply to:  8 ; comment:9 by Premislaus, 12 years ago

Replying to siarzhuk:

Replying to Premislaus:

I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?

I have no idea about the MediaPlayer but SE one can be related. Please attach (at least starting part) of this document too.

OK.

StyleEdit.png is a screenshot from RTF file.

by Premislaus, 12 years ago

Attachment: text_example added

in reply to:  9 ; comment:10 by siarzhuk, 12 years ago

Replying to Premislaus:

StyleEdit.png is a screenshot from RTF file.

Does it means that you have opened RTF file with StyledEdit and save it as text_example in utf-8 encoding? The text_example rendering in the Trac viewer and locally on my Windows is identical to your StyledEdit.png screenshot. Looks like the problem is somewhere on the way from rtf to utf-8 text. The RTF translator problem maybe?

in reply to:  10 comment:11 by Premislaus, 12 years ago

Replying to siarzhuk:

Replying to Premislaus:

StyleEdit.png is a screenshot from RTF file.

Does it means that you have opened RTF file with StyledEdit and save it as text_example in utf-8 encoding? The text_example rendering in the Trac viewer and locally on my Windows is identical to your StyledEdit.png screenshot. Looks like the problem is somewhere on the way from rtf to utf-8 text. The RTF translator problem maybe?

Yeah! This RTF file works fine under Windows, but displays the wrong characters in Haiku. The problem is with all text files that I have. I think it's a general problem, not only with rtf. I have problems with the txt and srt files, etc..

comment:12 by axeld, 12 years ago

Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.

For RTF, the case is different, though, as it contains the character encoding. So if an RTF is shown with the wrong encoding, it's a bug in the translator.

in reply to:  12 comment:13 by Premislaus, 12 years ago

Replying to axeld:

Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.

For RTF, the case is different, though, as it contains the character encoding. So if an RTF is shown with the wrong encoding, it's a bug in the translator.

This is a Microsoft fault and their stupid encodings. I created a new ticket... https://dev.haiku-os.org/ticket/9654

Version 0, edited 12 years ago by Premislaus (next)

in reply to:  12 comment:14 by siarzhuk, 12 years ago

Replying to axeld:

Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.

By the way, ICU has character detectors and we can profit from this by wrapping those features in the Local kit. This is what #9395 is about. Anyway STXTTranslator do some homebrew rudimentary assumption about encoding. So, IMO we have to push hands in the dirt sooner or later.

comment:15 by siarzhuk, 10 years ago

Owner: changed from siarzhuk to nobody

Those were taken some years ago as potential GCI tasks. Unfortunately no place is available for them in my schedule at this days.

comment:16 by pulkomandy, 8 years ago

Blocking: 6252 removed

(In #6252) Fixed in hrev50552 for StyledEdit. For Pe, this is now to be handled by Pe developers.

comment:17 by pulkomandy, 8 years ago

Blocking: 3065 removed

(In #3065) Fixed in hrev50552.

comment:18 by pulkomandy, 4 years ago

Milestone: R1R1/beta2
Resolution: fixed
Status: assignedclosed

Fixed in hrev50552 and hrev54166.

Note: See TracTickets for help on using tickets.