Opened 13 months ago

Last modified 8 months ago

#14674 new bug

StyledEdit Misreads UTF-8 Files as Something Else

Reported by: AGMS Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: Applications/StyledEdit Version: R1/Development
Keywords: Encoding Cc: agmsmith@…
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

When reading the BeShare BeShareDocs.txt file, which is UTF-8 and reads fine in BeOS and Linux, StyledEdit mangles accented characters. Is it picking the wrong encoding? Or is it something else going on?

By the way, the Encoding menu in the open file panel doesn't let you pick UTF-8, you get the default option which maybe autodetects. Perhaps have a separate "auto" menu item so you can force it to use UTF-8.

Attachments (5)

BeShareDocs.txt (66.0 KB ) - added by AGMS 13 months ago.
The sample file with accented characters in the credits section. Look at the notes for version 2.27 for example, where François's name gets mangled upon loading.
StyledEditEncodingBug.png (140.0 KB ) - added by AGMS 13 months ago.
Screen shot comparing output in Terminal of the file (it's fine) with StyledEdit after loading the same file, which gets mangled (highlighted in inverse video).
UTF8.txt (25 bytes ) - added by Pete 13 months ago.
Short text with some UTF-8 extended characters
ascii.txt (17 bytes ) - added by Pete 13 months ago.
Short plain ascii text
Windows.txt (25 bytes ) - added by Pete 13 months ago.
Similar characters to the UTF-8 file, but saved as Windows-1252 (equiv to ISO-8859-1)

Download all attachments as: .zip

Change History (14)

by AGMS, 13 months ago

Attachment: BeShareDocs.txt added

The sample file with accented characters in the credits section. Look at the notes for version 2.27 for example, where François's name gets mangled upon loading.

by AGMS, 13 months ago

Attachment: StyledEditEncodingBug.png added

Screen shot comparing output in Terminal of the file (it's fine) with StyledEdit after loading the same file, which gets mangled (highlighted in inverse video).

comment:1 by AGMS, 13 months ago

Further investigation by Pete shows something funny is also happening with the be:encoding attribute. It is sometimes an int32 set to 65535, at other times it is a text string naming the ISO encoding.

comment:2 by Pete, 13 months ago

Some more observations:

If you create a new text file, StyledEdit doesn't give you any encoding choice (all greyed out). When you save it, it gets tagged with the '65535' encoding -- which apparently equates to UTF-8.

If you load that or any other file back into StyledEdit, the encoding menu list is enabled, but trying to select anything but the already-selected choice doesn't work!

If a file already has a valid be:encoding attribute (as text rather than integer!) that will become StyledEdit's choice You can use addattr to set the attribute, and StyledEdit will accept it.

This all seems rather useless... (:-/) If you're creating a text it might well be for some purpose that requires, say, ISO-8859 encoding. So you should be given that initial choice. Conversely, if a file happens to have acquired the wrong encoding (as BeShareDocs.txt had, sokmehow), you need to be able to fix it!

comment:3 by pulkomandy, 13 months ago

The encoding to use when saving a file can be selected in the save filepanel. Likewise when opening a file. Which makes me wonder what's the point of having an encoding menu in the main window in the first place.

To keep things manageable, the internal encoding should always be utf-8 (or some other unicode representation), encoding is a file io operation, unless we want to restrict which characters can be entered to make sure a file can be faithfully represented in a given encoding. Also, loading then saving a file should of course preserve the existing encoding. I think these are the reason why StyledEdit keeps an internal notion of encoding even when not doing file IO.

I had a look through the sourcecode and I didn't find any place where we make an attempt to guess the encoding. Either I missed it (this involves BTextView, BTranslationUtils, and a bunch of other things), or it's missing altogether and should be added. I'm not sure how we end up deciding on ISO_8859, if in doubt, i'd say we should rather assume UTF-8 these days.

by Pete, 13 months ago

Attachment: UTF8.txt added

Short text with some UTF-8 extended characters

by Pete, 13 months ago

Attachment: ascii.txt added

Short plain ascii text

by Pete, 13 months ago

Attachment: Windows.txt added

Similar characters to the UTF-8 file, but saved as Windows-1252 (equiv to ISO-8859-1)

comment:4 by Pete, 13 months ago

It turns out I was confused by using two different revs of StyledEdit. My work partition is about 4 years old, but I initially tested in my latest hrev51670 from last December. The differences weren't obvious, but I think I have them sorted now.

I created some short test files: UTF8.txt, ascii.txt, and Windows.txt (attached). I initially stripped them of all attributes, then loaded them into (both versions of) StyledEdit.

My older version behaves fairly sanely. If I load either UTF8.txt, or ascii.txt, they are displayed correctly. If I quit without re-saving, no encoding attribute is added. If I save, the numeric attribute 65535 gets added.

If I load Windows.txt, with its non-Unicode characters, the attribute gets immediately set to 'iso-8859-1' (no saving required).

The new version is just weird. The attribute is set on loading in all cases (I never re-saved), but totally arbitrarily! Here's what I got with "catattr be:encoding *.txt", immediately after loading each into SE, with no saving:

UTF8.txt : string : UTF-8
Windows.txt : string : ISO-8859-1
ascii.txt : string : ISO-8859-2
BeShareDocs.txt : string : ISO-8859-1

Notice that a) UTF8.txt and BeShareDocs.txt, which have pretty much the same extended characters, get differerent encodings, and b) it thinks ascii.txt -- which has no extended characters -- is East European!! (BTW I did truncate all those strings as displayed by cataddr, because they are not null-terminated and get trailing garbage. Attribute length is correct)

This all seems to show that it is at least trying somehow to decipher the encoding, but it did rather better a few years ago!

I agree that the main window menu seems superfluous -- especially as it's not a selectable menu! (I'd never really noticed the Save Panel one! The one in the Load Panel, BTW, is ignored if be:encoding is set. Not sure if that's correct behaviour.)

How about replacing the main menu with a field in the menubar that displays the encoding used?

comment:5 by pulkomandy, 13 months ago

There is already a field that displays the encoding at the bottom next to the scrollbar.

It would help to know the hrev for your "about 4 years old" partition. It's possible the encoding detection was reworked to use ICU, but I don't remember when I did that.

in reply to:  5 comment:6 by Pete, 13 months ago

Replying to pulkomandy:

There is already a field that displays the encoding at the bottom next to the scrollbar.

Oh, yeah -- I see. It only appears for non-UTF8 text, so it never showed up for me. I'd say the menu should definitely go, then.

It would help to know the hrev for your "about 4 years old" partition. It's possible the encoding detection was reworked to use ICU, but I don't remember when I did that.

It's hrev50180.

comment:7 by waddlesplash, 13 months ago

There were a couple of bugs which might have affected this which were fixed in hrev52522.

comment:8 by cocobean, 8 months ago

Last edited 8 months ago by cocobean (previous) (diff)

comment:9 by AGMS, 8 months ago

Still not working as of hrev53065 (April 2019).

Note: See TracTickets for help on using tickets.