Opened 13 months ago

Last modified 11 months ago

#14738 assigned bug

[pc_serial] Sometimes gets stuck in Write() (even in async mode)

Reported by: ttcoder Owned by: mmu_man
Priority: normal Milestone: Unscheduled
Component: Drivers/TTY Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

I have a report from KEZM (they have an RS 232 switcher piloted by CC on Haiku) : CC tends to get stuck on the serial write a few times per week -- the app freezes.

My code configures the BSerialPort like thus

serialInput.SetBlocking( false );

Yet it seems to get in a 'blocked' state ?

Change History (7)

comment:1 by ttcoder, 13 months ago

Running ps, he gets this:

/boot/home/config/apps/TuneTrackerSystems/CommandCenter   589        7    0    0 
  Thread                                   Id    State Prio    UTime    KTime
CommandCenter                           589     wait   10  1294940   340328 
event_server                            607     wait   10    85438    30826 SwitcherHandler_looper(3464)
logging_server                          608     wait   10    17937     8280 
render                                  610      zzz   15  1514434  1373240 
w>Command Center                        618     wait   15   953315   243675 
SwitcherHandler_looper                  624     wait   10   222743   178693 pc_serial:done_write(3480)

Looking for "pc_serial:done_write" yields this: http://xref.plausible.coop/source/xref/haiku/src/add-ons/kernel/drivers/ports/pc_serial/SerialDevice.cpp

From my limited understanding, it seems the semaphore is acquired in Write() and released in WriteCallbackFunction() ? So the fact my thread remains stuck means the callback was not called, does that sound correct? How do I deal with that in my app, can I "unblock" my thread from another thread ?

EDIT: oddly, looking for other references to the callback yields nothing: http://xref.plausible.coop/source/search?q=WriteCallbackFunction&defs=&refs=&path=&hist=&type=&project=haiku Is it not called by anyone?

Last edited 13 months ago by ttcoder (previous) (diff)

comment:2 by pulkomandy, 13 months ago

Component: Kits/Device KitDrivers/TTY
Owner: changed from pulkomandy to mmu_man

It seems the pc_serial driver does not handle non-blocking write nor timeouts. So I don't see a way to fix this from the application side.

I don't see where that write callback is called, however I see the semaphore is also released in the interrupt handler.

I suspect a race condition, where we manage to call write twice, quickly enough, and the interrupt triggers only once, resulting in the second write call blocking for no reason.

Reassigning to mmu_man, as he wrote this driver.

in reply to:  2 comment:3 by ttcoder, 13 months ago

Replying to pulkomandy:

I suspect a race condition, where we manage to call write twice, quickly enough, and the interrupt triggers only once, resulting in the second write call blocking for no reason.

Very interesting.. If that is indeed the cause, I guess it would be mitigated by doing a snooze(9000) (or whatever the delay is between interrupts) after each write? Remember, a perfect fix later does not have to exclude an immediate work-around -- the station would be happy if I could provide a timely improvement :-)

comment:4 by mmu_man, 13 months ago

WriteCallbackFunction seems to be unused, it's a leftover from the usb_serial driver code which was used as model.

Looking back at the code it's amazing it actually works :)

comment:5 by ttcoder, 11 months ago

Sent the station a build of CC with a snooze( 100000 ) call before each write(); I don't think he's seeing lock-ups/freezes any more, but it seems to have made things much worse for some reason, regarding the other bug -- the one with the serial port getting "out of sync" and confusing the switcher box ID...

Just realized there is this in his syslogs...

KERN: pc_serial: write: failed to get write done sem 0x8000000a

... and that 0x8000000a resolves to 0x8000000a: Interrupted system call

Let's take a bit of a logical leap here, and assume that the above is related to either (or both) bugs he is seeing, and needs to be addressed...

I seem to recall that when a piece of call is vulnerable to that, it needs to wrap its call in a while() loop, something like "while( write(blah) == B_INTERRUPTED ) try_again()"; maybe that is true of the pc_serial driver as well, it would need such a while loop ?

comment:6 by korli, 11 months ago

a write() syscall is allowed to return EINTR (see http://pubs.opengroup.org/onlinepubs/9699919799//functions/write.html : [EINTR] The write operation was terminated due to the receipt of a signal, and no data was transferred. )

The loop could happen on the client side. Would doing the loop in pc_serial not cause problems when using for instance Ctrl+Break?

comment:7 by pulkomandy, 11 months ago

The driver cannot do much about this.

The idea of this is that an application should reply immediately to signals (for example when you press control + C). So, if it is blocked in a system call (such as read() or write()), that will be interrupted and return EINTR to signal it to the application.

The application can then handle the signal, and restart its system call.

Alternatively, you can set SA_RESTART on the signal settings to let the syscall be retried automatically.

Note: See TracTickets for help on using tickets.