Opened 3 years ago

Closed 9 months ago

Last modified 4 months ago

#16846 closed enhancement (fixed)

The future of fd stateful monitoring / eventing in Haiku

Reported by: kallisti5 Owned by: nobody
Priority: normal Milestone: R1/beta5
Component: Kits/Kernel Kit Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description (last modified by kallisti5)

Today Haiku offers a "wait_for_objects" API (stateless) which is as close as we get to epoll / kqueue which exists on OS X / FreeBSD / Linux.

Pulkomandy pointed out a great overview here:

https://fosdem.org/2021/schedule/event/file_descriptor_monitoring/

... and some previous work around event polling here:

https://github.com/hamishm/haiku/tree/eventqueue

A good model would likely be:

  1. Implement a fairly standard/complete io_uring (fd and sockets)
  2. Add epoll/kqueue compatibility functions which silently leverage io_uring
  3. Drop wait_for_objects or make it official. (only consumer seems to be power daemon https://git.haiku-os.org/haiku/tree/src/servers/power/power_daemon.cpp#n118)

That video mentions performance improvements at large scale for things leveraging epoll/kqueue/io_uring over the POSIX poll/select

Change History (29)

comment:1 by kallisti5, 3 years ago

Description: modified (diff)
Summary: Better document wait_for_objects API calls, or reimplement?The future of fd stateful monitoring / eventing in Haiku

comment:2 by kallisti5, 3 years ago

Description: modified (diff)
Keywords: io_uring added

comment:3 by kallisti5, 3 years ago

Side note, a lot of this was spawned due to Rust's mio crate not working under Haiku. Their design assumes a stateful fd monitoring API.. which we don't have.

https://github.com/tokio-rs/mio/issues/1472

comment:4 by korli, 3 years ago

This is partly misleading.

OpenJDK uses wait_for_objects

It's part of the official public API: https://git.haiku-os.org/haiku/tree/headers/os/kernel/OS.h#n642

comment:5 by kallisti5, 3 years ago

/* WARNING: Experimental API! */

:-)

Either way.. wait_for_objects is stateless (which has caused mio to pretty much list us as unsupported outright), and likely suffers the same performance issues at scale as poll/select mentioned in the video above.

in reply to:  5 comment:6 by korli, 3 years ago

Replying to kallisti5:

/* WARNING: Experimental API! */

:-)

Yeah it might be removed or changed in a future release, after R1. To be documented at https://dev.haiku-os.org/wiki/FutureHaiku

comment:7 by waddlesplash, 2 years ago

comment:8 by waddlesplash, 2 years ago

Keywords: epoll kqueue io_uring removed

comment:9 by diver, 2 years ago

Is it possible to compile mio after https://github.com/smol-rs/polling/pull/26?

comment:10 by X512, 2 years ago

poll has significant speed impact in Wine. wine_server spend most time in kernel.

https://discuss.haiku-os.org/t/my-progress-in-porting-wine/11741/63

comment:12 by X512, 14 months ago

My WIP work: https://github.com/X547/haiku/commits/wait-objects3.

The idea is sending selected object events as port messages. It is planned to suppirt 2 message formats: RAW (send object_wait_info structure in massage body) or KMessage (can be handled by Application Kit BHandler so selected object event handling can be gracefully integrated with the rest message handling). In case of KMessage, BHandler token ID also need to be specified.

Because object events are level based and continue signalling after port message is generated, special mechanism is introduced to suppress message generation if it is already present in port message queue. Also message generation is delayed if port queue if full (capacity limit specified at port creation is reached).

Public API is planned to be like this:

enum {
    B_SELECT_OBJECT_IS_MESSAGE = 1 << 0,
    B_SELECT_OBJECT_AUTO_STOP  = 1 << 1,
    B_SELECT_OBJECT_CLEAR      = 1 << 2,
};

/**
@param port port that will receive selected object events
@param token used to specify destination BHandler, meaningful only if B_SELECT_OBJECT_IS_MESSAGE flag is set
@param infos an array of object identifiers and bitset of events to be watched. If some events are already being watched for specified port, watching status will be updated (object will be stopped being watched if zero event bitmask will be specified). B_SELECT_OBJECT_CLEAR flag make to ignore previous object watching state so only new watch state will be applied (and watching will be fully stopped if empty infos array will be provided). If B_SELECT_OBJECT_AUTO_STOP flag is set, event will be automatically stopped being watched after message is generated.
@param numInfos length of infos array
*/

status_t
watch_objects(port_id port, int32 token, object_wait_info* infos, int numInfos, uint32 flags);
Last edited 14 months ago by X512 (previous) (diff)

comment:13 by waddlesplash, 14 months ago

What's the auto-stop/clear messages for?

comment:14 by X512, 14 months ago

B_SELECT_OBJECT_AUTO_STOP is intended for messages handled with BLooper. BLooper copy messages to memory (BMessageQueue) before processing, so it will cause message storm if event generation will be not disabled after sending message. Kernel port queue message storm prevention mechanism will not work here. BHandler::MessageReceived can re-enable event after processing. B_SELECT_OBJECT_CLEAR is needed to stop watching all events.

comment:15 by pulkomandy, 12 months ago

Hello,

So, a few notes from my experience writing userspace code and needing an API like this.


I have two use cases to document. Neither is performance-critical to the point of needing io_uring (even if it is probably a sane base to build higher level APIs on), rather, I am interested in how this can integrate with the existing BLooper infrastructure.

The first one is the "services kit". The idea here is to handle network traffic in a way that integrates easily with existing applications. The way this is done in the current iteration is a separate thread (userspace) that gets the data from the network, does some processing, and eventually forwards the processed data in the form of BMessage to a BHandler that can be integrated in a typical application.

This approach did not work well for several reasons:

  • Performance: instead of just getting the data directly from the socket and processing it, we get generally the following: data is read from the socket, processed, serialized back into a BMessage, sent back to kernel to be forwarded to another thread. This had significant impact on performance
  • Synchronization: in this approach, there are two threads working on the same 'object' (a network connection) and this create a lot of complexity. Some code has to run in the http thread, some code is better run on the BLooper side. It would be a lot simpler if the BLooper could simply manage the socket and all the processing would be in a single thread

I have similar issues in Renga with the gloox xmpp library. There I could mostly solve it by leaving the BApplication thread completely unused. All I have to work with is one network and one BWindow thread. But there is a main thread sitting there doing nothing and wasting some resources.

A third example is the ACE Amstrad CPC emulator. No network in this case, but I need a thread that both processes BLooper messages and also does its own "background" processing when there are no pending message. The code I ended up with there looks like this:

void MyClass::MessageReceived(BMessage* message)
{
    switch(message->what)
    {
        // Do the usual processing here
    }

    while (!IsMessageWaiting())
    {
        // Do the "idle" processing here, until the looper has a message pending
    }
}

Starting from this pattern, I do not need much more: just give me a way to access the BLooper underlying port, and add it to a wait_for_objects-like thing that I can check in this loop instead of BLooper::IsMessageWaiting(). Then I can dispatch the events myself as I need.

This approach is a bit hacky when done this way, but requires no extra overhead in BLooper.

Outside of Haiku, I have also hit cases where it would have been useful to wait on not only a file descriptor, but also, for example, a pthread_mutex. On Linux, this is not possible with epoll, they only allow waiting on file descriptors, and have exposed only some things as file descriptors (timerfd, signalfd, eventfd, ...) but not pthread things. I think the other APIs proposed (kqueue, wait_for_objects) don't have this problem. I am not sure about io_uring.


On "fairness".

x512 (in IRC discussions) seems worried that some event sources would be handled in priority, and if such an event source has a lot of events, it would be always processed first, and could prevent the other ones from being processed at all.

His solution is to serialize all events in a port. The kernel notifies each event using the port, whenever something happens in one of the watched objects. Since the port has a FIFO queue, this guarantees the events are processed in order.

I think it is not a problem to process the events in order with the other APIs as well.

I have not much experience with the low-level details of ports, so I can't say if there are problems or advantages to this approach, in terms of ease of use and performance.

I also don't know what the plan would be, can the port be used for a mix of noifying events on other objects, and also receiving BMessage for a BLooper? Or would we put that in different levels, there would be a port with just object change events, and then it would notify that another port has a pending message?

The fact that a port can queue message raises a few questions. What if one of the queued events is notifying about a file descriptor (or other object) that was since removed from the watch set? Who should take care of ignoring it? Can we drop such things from the message queue? There is also the question of "edge" vs "level" triggers in epoll, does our API need to handle that? Or is it somehow irrelevant for us?

comment:16 by waddlesplash, 12 months ago

But there is a main thread sitting there doing nothing and wasting some resources.

Doesn't the main BApplication thread do mostly "nothing" in most applications, as it is?

Who should take care of ignoring it?

Ah, I hadn't thought of this. That sounds like an important difference, indeed.

My arguments for why we should not use ports only for this events API mainly revolved around: it sounds simpler and more "native", but ultimately requires more complexity, what with the "resubscribe" setup X512 details above which is necessary to prevent floods due to port and BMessageQueue behavior, and the overhead from processing events as BMessages.

I think a dedicated call, like kqueue/epoll have, makes more sense. If necessary, we can integrate this natively into BLooper; or use a setup like the one you propose here.

in reply to:  15 comment:17 by X512, 12 months ago

Replying to pulkomandy:

What if one of the queued events is notifying about a file descriptor (or other object) that was since removed from the watch set?

For deleted objects like closed FDs during watching there are dedicated event B_EVENT_INVALID. If some event is unsubscribed but already enqueued, it will still be delivered. Messages sent to port are never altered. But message sending can be cancelled on unsubscribe if it is not yet enqueued (no more free slots in port).

in reply to:  16 comment:18 by X512, 12 months ago

Replying to waddlesplash:

But there is a main thread sitting there doing nothing and wasting some resources.

but ultimately requires more complexity, what with the "resubscribe" setup X512 details above which is necessary to prevent floods due to port and BMessageQueue behavior, and the overhead from processing events as BMessages.

In simple case resubscribe is just one extra line of code in SomeHandler::MessageReceived(). I do not see much complexity. Definitely much simpler than complete redesign of BLooper message loop. If separate syscall for each resubscribe is considered as efficiency problem (if should be benchmarked first), some subscribe queue can be introduced similar to BMessageQueue. So all subscriptions will be collected and sent to kernel with single syscall after multiple BHandler::MessageReceived are processed. It do not need to redesign BLooper event loop, just add one extra line to process pending subscribe events.

comment:19 by pulkomandy, 12 months ago

Doesn't the main BApplication thread do mostly "nothing" in most applications, as it is?

Yes, I was thinking of the 1MB of RAM or so allocated to its stack, as well as at least one port (a scarce resource currently, we can only have up to 4096 accross the whole system, but we should probably change that).

but ultimately requires more complexity, what with the "resubscribe" setup X512 details above which is necessary to prevent floods due to port and BMessageQueue behavior, and the overhead from processing events as BMessages.

Using ports does not necessarily imply using BMessage. We can send any data we want to a port, and in this case I think it should be a simpler structure, probably the same object_wait_info as used for wait_for_objects (there could be several of these in a single write to the port, if there are multiple pending events).

So, I don't know if I'm convinced by x512's proposal for integration with BLooper (encapsulating the data in BMessage and trying to play nice with BMessageQueue). But for the other part of it, that is, sending messages as "raw" type containing object_wait_info structures, it seems like this would work. He says that the events in the port queue can be collapsed kernel-side to avoid sending the same event multiple time while its pending in the queue. If this can be done in a way that a single read from the port returns multiple object_wait_info structures, this would already be very efficient and flexible, and not very difficult to use.

For BLooper integration I think I would prefer to make more changes to BLooper to fit around this new API, rather than make the API be able to fit around the way BLooper works currenty. But I have not researched this deep enough to be able to give reasons for that.

The only reason I can see for developping something entirely new and not using ports, is to save on memory copies. In that case, we could consider doing something more like io_uring. But this is a lot more complicated, and in the cases I can think of, the port based system will probably be working well enough, and also be easier to set up and debug.

I did not check the code and I don't really know if that needs a lot of changes kernel-side to the implementation of ports.

comment:20 by waddlesplash, 12 months ago

Yes, I was thinking of the 1MB of RAM or so allocated to its stack

Don't we overcommit stacks (or at least this stack) by default? I think that shouldn't be a problem, then.

as well as at least one port (a scarce resource currently, we can only have up to 4096 accross the whole system, but we should probably change that).

The BApplication port is of course used for file messages, BBitmap handling, etc. I think 4096 is probably too few ports, yes.

comment:21 by pulkomandy, 12 months ago

Don't we overcommit stacks (or at least this stack) by default? I think that shouldn't be a problem, then.

Then it does not use memory, but it does use address space. Only a problem for 32-bit systems really, and the address space is not going to run into problems because of 1MB reservation.

But, do we overcommit it? This seems a bad idea, what if there is no physical page available when you want to call a function? You do a segfault? How do you then call the signal handler if there is no space on the main stack?

in reply to:  21 comment:22 by X512, 12 months ago

Replying to pulkomandy:

How do you then call the signal handler if there is no space on the main stack?

Setup dedicated signal stack.

comment:24 by pulkomandy, 10 months ago

The discussion has popped up again on IRC and I'd rather have the discussion here.

So there are several ways to implement this. Personally I'm inclined towards implementing FreeBSD kqueue, here is a short summary of why.

Linux epoll

I use this API a lot in my paid work and as a result I think I understand its limitations pretty well.

The main problem with it is that it only works with file descriptors. As a result, over the years Linux has grown a variety of file descriptor style things to do anything. inotify for watching filesystem changes. timerfd for timers. signalfd for managing signals. eventfd for a very simple interprocess signalling system. And of course sockets, pipes, and actual files.

If you want to wait on something that is not a file descriptor, you can't. For example, there is no way to integrate this with pthread mutexes, or in Haiku case, with our native semaphore and ports.

There are also problems with managing signals as file descriptors, in particular in the context of fork/exec or posix_spawn.

Implementing epoll without also bringing in all the other file descriptor wrapping that Linux does makes it considerably less useful.

kqueue

This is the FreeBSD approach. Unlike the Linux one, it is not restricted to file descriptors, and as a result it is more easily adjusted to wait on various other things.

You can see this as a downside, as Ariadne Conill explains very well here: https://ariadne.space/2021/06/06/actually-bsd-kqueue-is-a-mountain-of-technical-debt/

But given the current situation (we do want to wait on non-filedescriptor things), let's accept the "technical debt" and have this API with a lot of special cases.

In kqueue, adding new types of things to watch will require custom APIs everytime, but it is a new function for each type of things. This is already significantly better than the current wait_for_objects, which would require adding new arguments to the existing wait_for_objects function.

io_uring

This API is designed for very high performance levels. It is based on a shared ring buffer between kernel and userspace, which completely remove the need to have syscalls as the interface to add things to watch and to retrieve events.

Its main problem is that it is quite more complicated to use than the other ones. I would be OK with something like this as the low-level primitive we implement, since it could be possible to build the other simpler APIs on top of it. But is it worth the effort?

Doing something custom

This is what x512 has started experimenting with.

It seems a possible upside is the ability to receive the events through a port, which can be very easily integrated into the existing BLooper code. I can see how this is useful, since I had problems making the network sockets and BLooper interact together in the "services kit" code.

I think it would also be possible to rebuild BLooper on top of an epoll/kqueue-style system, instead of getting events from only a single port. There is however some overhead to that for receiving normal messages through the port (since you have to get notified by the epoll/kqueue, and then make another syscall to actually read the port, whereas receiving a message from a port is normally done in a single syscall (a blocking read), IIRC).

Of course the downside to this approach is that it is custom. And so, that may create difficulty in porting existing code from Linux and BSD. But then again:

  • Do we want to favoritize porting Linux/BSD code over native BeAPI things?
  • Since Linux and BSD are so different, software supporting both is probably going to standardize on libuv as the high-level API hiding the differences, if that's the case, we may port libuv to our own custom things, and then we can easily build all the software on top of that. This raises the question, if we care about ported software, how much of it does use libuv and how much of it does use kqueue and/or epoll directly?

The existing wait_for_objects

Just for completeness since "what's wrong with wait_for_objects?" is a common question: it is a stateless system, which means:

  • The list of things to watch must be re-sent to the kernel at every syscall, if you watch a lot of things this can be inefficient
  • That also makes it difficult to implement "edge trigger" notifications (similar to EPOLLET in epoll).

The TL;DR

From my point of view:

  • epoll is not good because it only works with file descriptors
  • io_uring is not good because it is too complicated as the high level API to expose (if we are going to pick just one API). It may still be good as the underlying way to implement things, but even then, it seems more work to get it running
  • wait_for_objects is not good because it is stateless

I think this part above is uncontroversial and agreed by everyone I discussed this with? But if I'm wrong I'd like to hear the arguments for these options and reconsider.

This leaves us with a choice of either kqueue, or x512's custom solution. kqueue is better for reusing what already works elsewhere, while x512's solution is possibly better for integration with BLooper (I put "possibly" because it is hard to say without having ever used either API myself, ran benchmarks, etc).

Of course x512's solution has the advantage of being already implemented (at least partially? I don't know how complete it is).

comment:25 by tqh, 10 months ago

I like the idea of the port based one, epoll and kqueue seems very unfriendly API's to me, while BLooper API probably isn't?

We picked HVIF over SVG with gzip, did our own package manager because the others seemed worse and did a lot of other things our own way. So, as someone who only read the sales brochures, would the port based API be easier to work with? I also feel that we will do a good implementation no matter what we choose, so it all about what compromises we want. To me a good API comes high on that list.

comment:26 by trungnt2910, 10 months ago

I personally prefer the more popular API (so epoll or kqueue), though a custom API would be fine as long as it could be cleanly used in C.

I don't want something that is tightly coupled with the versatile yet complicated, private and undocumented KMessage and require using C++, linking to libbe, and use the expensive BLooper just to watch simple events such as network change.

Also, an FD would be preferred over a port as cross-platform code might want to assume having FDs since both kqueue and epoll returns them, unless Haiku could provide such a thing as a portfd.

comment:27 by pulkomandy, 10 months ago

It seems we can't satisfy everyone with a single API, so maybe we should consider if it's possible to have both: kqueue and something port-based.

I don't know the internals of the select/poll/wait_for_objects implementation to decide if that's reasonably doable, however.

comment:29 by waddlesplash, 9 months ago

Milestone: UnscheduledR1/beta5
Resolution: fixed
Status: newclosed

Merged in hrev57174.

Note: See TracTickets for help on using tickets.