Opened 9 years ago

Last modified 3 years ago

#6447 assigned bug

StyledEdit doesn't check for valid utf-8

Reported by: zooey Owned by: nobody
Priority: normal Milestone: R1
Component: Applications/StyledEdit Version: R1/alpha2
Keywords: Cc:
Blocked By: #9395 Blocking: #7954
Has a Patch: no Platform: All

Description

If you use StyledEdit to open a file that doesn't contain utf-8 but text encoded in a different charset, BTextView will get confused about the character widths and strange things will happen when editing near the bogus characters.

Easily reproducible with /boot/system/data/Canna/default/default.canna.

So I guess StyledEdit (and the file-based BTextView::SetText()) should check the file content for being valid utf-8 and complain if it isn't.

Attachments (3)

MediaPlayer_subtitles.png (482.2 KB) - added by Premislaus 6 years ago.
StyleEdit.png (169.6 KB) - added by Premislaus 6 years ago.
text_example (13.5 KB) - added by Premislaus 6 years ago.

Download all attachments as: .zip

Change History (20)

comment:1 Changed 6 years ago by siarzhuk

Owner: changed from korli to siarzhuk
Status: newassigned

This ticket was published as GCI 2012 task. http://google-melange.appspot.com/gci/task/view/google/gci2012/7986202 So I take ownership of it.

comment:2 Changed 6 years ago by siarzhuk

Hm, looks like this issue (also both #3065 and #7954) is by design of our StyledText data translator.

One of possible solutions can be adding STXT translator option - "Replace bogus UTF-8 data with ..." and fix the translator to check for invalid utf-8 sequences, non-printable and '\0' characters before the real end of the data stream. After this we can implement the following: 1) Set the "replace bogus data" option "off" and try to read text file; 2) If step 1 has failed - nag user with the question - "File contains invalid characters. Proceed with replacing bogus data?"; 3) If user agreed to replace - switch the option "on" and try to load text file again;

Are there any opinions about proposed solution?

PS: Please share your ideas before GCI competition starts at Monday 26 Nov. :-) Thank you.

comment:3 Changed 6 years ago by siarzhuk

Blocking: 3065 added

(In #3065) Potential duplicate of #6447

comment:4 Changed 6 years ago by siarzhuk

Blocking: 7954 added

(In #7954) Potential duplicate of #6447

comment:5 Changed 6 years ago by siarzhuk

Blocking: 6252 added

(In #6252) Related to problem discussed in #6447. In case mentioned powering STXT translator with ICU Charset Detector this one should be fixed too.

comment:6 Changed 6 years ago by siarzhuk

Blocked By: 9395 added

comment:7 Changed 6 years ago by Premislaus

I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?

Changed 6 years ago by Premislaus

Attachment: MediaPlayer_subtitles.png added

Changed 6 years ago by Premislaus

Attachment: StyleEdit.png added

comment:8 in reply to:  7 ; Changed 6 years ago by siarzhuk

Replying to Premislaus:

I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?

I have no idea about the MediaPlayer but SE one can be related. Please attach (at least starting part) of this document too.

comment:9 in reply to:  8 ; Changed 6 years ago by Premislaus

Replying to siarzhuk:

Replying to Premislaus:

I have a problem with displaying Polish characters in StyleEdit and MediaPlayer (subtitles). They are displayed poorly, wrong character set. Does not help change the text encoding! This is related to this ticket, or it is another problem?

I have no idea about the MediaPlayer but SE one can be related. Please attach (at least starting part) of this document too.

OK.

StyleEdit.png is a screenshot from RTF file.

Changed 6 years ago by Premislaus

Attachment: text_example added

comment:10 in reply to:  9 ; Changed 6 years ago by siarzhuk

Replying to Premislaus:

StyleEdit.png is a screenshot from RTF file.

Does it means that you have opened RTF file with StyledEdit and save it as text_example in utf-8 encoding? The text_example rendering in the Trac viewer and locally on my Windows is identical to your StyledEdit.png screenshot. Looks like the problem is somewhere on the way from rtf to utf-8 text. The RTF translator problem maybe?

comment:11 in reply to:  10 Changed 6 years ago by Premislaus

Replying to siarzhuk:

Replying to Premislaus:

StyleEdit.png is a screenshot from RTF file.

Does it means that you have opened RTF file with StyledEdit and save it as text_example in utf-8 encoding? The text_example rendering in the Trac viewer and locally on my Windows is identical to your StyledEdit.png screenshot. Looks like the problem is somewhere on the way from rtf to utf-8 text. The RTF translator problem maybe?

Yeah! This RTF file works fine under Windows, but displays the wrong characters in Haiku. The problem is with all text files that I have. I think it's a general problem, not only with rtf. I have problems with the txt and srt files, etc..

comment:12 Changed 6 years ago by axeld

Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.

For RTF, the case is different, though, as it contains the character encoding. So if an RTF is shown with the wrong encoding, it's a bug in the translator.

comment:13 in reply to:  12 Changed 6 years ago by Premislaus

Replying to axeld:

Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.

For RTF, the case is different, though, as it contains the character encoding. So if an RTF is shown with the wrong encoding, it's a bug in the translator.

This is a Microsoft fault and their stupid encodings. I created a new ticket... #9654

Last edited 6 years ago by Premislaus (previous) (diff)

comment:14 in reply to:  12 Changed 6 years ago by siarzhuk

Replying to axeld:

Haiku always uses UTF-8 by default. Guessing the character encoding is pretty hard to do for most encodings, and usually requires a lot of work. Granted, you can easily find out if something is not UTF-8, but that doesn't get you that far.

By the way, ICU has character detectors and we can profit from this by wrapping those features in the Local kit. This is what #9395 is about. Anyway STXTTranslator do some homebrew rudimentary assumption about encoding. So, IMO we have to push hands in the dirt sooner or later.

comment:15 Changed 4 years ago by siarzhuk

Owner: changed from siarzhuk to nobody

Those were taken some years ago as potential GCI tasks. Unfortunately no place is available for them in my schedule at this days.

comment:16 Changed 3 years ago by pulkomandy

Blocking: 6252 removed

(In #6252) Fixed in hrev50552 for StyledEdit. For Pe, this is now to be handled by Pe developers.

comment:17 Changed 3 years ago by pulkomandy

Blocking: 3065 removed

(In #3065) Fixed in hrev50552.

Note: See TracTickets for help on using tickets.