Opened 12 years ago

Last modified 4 years ago

#976 assigned bug

convert_from_utf8() broken

Reported by: axeld Owned by: pulkomandy
Priority: normal Milestone: Unscheduled
Component: Kits/libtextencoding.so Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description (last modified by pulkomandy)

convert_from_utf8() uses iconv() in the background. However, it opens/closes the iconv context with every invocation, which is just wrong.

It becomes obvious with the UTF-16 encoding, where a marker is added at the beginning of the file, or, in our case, for every call to convert_from_utf8().

It is not possible to open/close the context properly without changing the API. Since this conflicts with the goal of Haiku R1, this bug is delayed to post-R1 schedule. Likely, a new API will need to be designed as part of the locale kit. It may be done before R2, and coexist with the legacy one.

Change History (15)

comment:1 Changed 12 years ago by wkornewald

Component: - GeneralKits/libtextencoding.so

So, you complain about components, but then don't even use them? :)

comment:2 Changed 12 years ago by axeld

Well, guess what, it didn't exist when I wrote that bug report :-) And I needed to search it quite a while until I really knew that, I must have forgotten updating the bug afterwards ;-)

comment:3 Changed 12 years ago by jackburton

Would it work if we used a static variable (TLS ?) to keep the initialized state of iconv ? Thus we'd just call iconv_open() the first time convert_encoding() is called...

comment:4 in reply to:  3 Changed 12 years ago by jackburton

Replying to jackburton:

Would it work if we used a static variable (TLS ?) to keep the initialized state of iconv ? Thus we'd just call iconv_open() the first time convert_encoding() is called...

Hmmm problem would be, thought, that we wouldn't know when to call iconv_close(). Ok, let go what I just said.

comment:5 Changed 12 years ago by korli

An API change seems to be unavoidable. convert_encoding_open() convert_encoding_close() More or less the way iconv work :/

comment:6 Changed 12 years ago by axeld

We can't change the API in such a way if we want to stay binary compatible, though.

I think we could work out something with TLS and the "state" parameter, even though it wouldn't be that clean: we could only lazily close the iconv context in case a) a new translation is started (with "state" = 0), or if b) the thread exits.

Creating a new iconv context for every call to convert_{from|to}_utf8() is also very slow.

comment:7 Changed 12 years ago by korli

Sounds good. But IMO if it is too hacky for us, we could (should?) add some functions or a class which clearly open and close the conversion stream.

comment:8 Changed 12 years ago by mmu_man

Hmm and what about apps that do more than 1 convertion at the same time in the same thread, with different state vars ? like, it's converting to UTF-8, and from it to something else... interleaving calls. I know it's lame but possible. What about a (thread-safe = locked) list (hash table ?) of state pointers ? Each time a new pointer is seen we allocate an iconv context. We wouldn't really know when to close the context though, but it would be more correct.

comment:9 Changed 9 years ago by axeld

Owner: changed from axeld to nobody
Version: R1/pre-alpha1R1/Development

comment:10 Changed 9 years ago by pulkomandy

Owner: changed from nobody to pulkomandy
Status: newassigned

comment:11 Changed 8 years ago by pulkomandy

Description: modified (diff)
Milestone: R1Unscheduled

comment:12 Changed 8 years ago by mmu_man

Indeed it's still a problem it seems... We can't put a pointer to a conv_struct unless we allow leaking them.

One option is to keep opening/closing the context on each call, but copy back the istate/ostate member of the struct conv_struct that is returned by iconv_open(), depending on if encoding to or from a multibyte format. It seems to be the only member that actually maintains state, the rest being constant for each invocation and only depending on the to and from encoding setting. Unless istate and ostate are used at the same time... but probably not with those two calls.

comment:13 Changed 8 years ago by mmu_man

It seems since we convert to and from UTF-8 and iconv internally uses UCS-4 we will probably be using both istate and ostate in this usage pattern though :-(

comment:14 Changed 4 years ago by jackburton

Couldn't we use ICU for this ?

comment:15 Changed 4 years ago by pulkomandy

ICU has the same "problem" as iconv: it needs explicit opening and closing of the context. As mentioned in the previous comments, stateless conversion is not possible, and a new API will be needed as the current one doesn't allow deleting the state when done. The use of an int32 as a state cookie is also not helpful, as we can't store a pointer to the state there, at least on 64-bit platforms.

So, we should introduce a separate API to handle this in a more correct way.

Note: See TracTickets for help on using tickets.