Log in

No account? Create an account

Previous Entry | Next Entry

select() and a thread pool

Here's the challenge: a select()-based event dispatcher that can scale from single-threaded to multi-threaded efficiently, preferably as simple as possible and able to be effectively edge-triggered (even if select() itself is level-triggered).

Note that the point here is to run the event handlers concurrently. I've used threads in the past to work around deficiencies in select(), putting "quiet" file descriptors on one thread that would sleep most of the time, and "busy" file descriptors on another (since select() scales based on the number of file descriptors it watches, this made the active set more efficient), but it would still only use one thread for event handlers. It was just about sleeping more efficiently.

My first idea was that all threads would be symmetrical and would thus all sleep in select(). But that doesn't work, obviously, because as soon as one file descriptor is ready, all the threads wakes up, which is rather wasteful. Splitting the set of file descriptors between the threads isn't so good, because a scheduler would then be needed to balance the load between threads, and I prefer to leave scheduling to the kernel, as much as possible. On the other hand, this could allow better cache locality, handling the same file descriptor on the same thread (possibly bound to the processor) as often as possible, but on the third hand, let's not get ahead of ourselves.

It seems that the only way is to keep all the file descriptors together, with just one thread calling select(). When it returns, it should be examined and all the events put on a list. Then it goes into a loop of taking an event from the head of the list, posting the semaphore of the next thread (which could be itself in the single-threaded case), calling the event's handler, then waiting on the semaphore. If there was no event left, it goes back to select() instead.

There's a few good bits and a few bad bits about this design. A good bit is that the semaphores that keeps the other threads sleeping also protect the list at the same time (that's why it's posted after taking the next event). A bad bit is that at the end, we would be going into the select() before all the handlers are done, and they might want to add some file descriptors. This could be fixed by having a counter of threads busy running handlers, and the last thread that would be out of events would be the one going back into select(), but this would also make the load more "bursty" in a busy server, where there's really work all the time, but at every round of select(), all the threads but one would go to sleep, only to be reawakened.

I think, in the end, that the usual Unix solution comes to the rescue: add a file descriptor! Having the reading end of a pipe in the select() invocation, and adding a file descriptor from a handler would write to the writing end, waking up the select() as needed. A bit of a shame, since it would be unnecessary in the single-threaded case, but oh well, that's how it is sometimes...

Anyone has suggestions for improvements? They are most welcome!


( 25 comments — Leave a comment )
Mar. 7th, 2007 07:00 pm (UTC)

But then again, Burma.
Mar. 7th, 2007 07:09 pm (UTC)
It's a bit confusing sometimes, having me on your friends list, isn't it? :-)

The thing is, I have this thing where I believe where I have this concept that things I spew out should all be in a single place. Now, I use the tags to export feeds to other sites, so the technically-oriented people get just the tech stuff (and really, most of the time the slightest hints of my personal life makes them uncomfortable), but those poor schmucks on LJ, just friending me, they're doomed!

I find it very disappointing that LJ confuses all sorts of things together, like "the people I don't mind read my friends-locked posts" and "the people I want to read", for example. It'd be neat if you could do some stuff like "yeah, pphaneuf is cool, but his 'power management' and 'scalability' tags are way too hardcore for me, no thanks".
Mar. 8th, 2007 03:09 am (UTC)
Not confusing so much as "OK I wish I could help but I can't."

How about this? Create a coding/advotago/scalability/tech sub-group to your friends group.

Or there's the ever popular LJ cuts.

But really, I like reading your stuff, especially the picture stuff, 'cause I'm insanely jealous of the pretty pictures you create. I just don't like not being able to reply to cries for help when I am unable to/lack the ability to help.

Big hugs to azrhey.
Mar. 8th, 2007 09:08 am (UTC)
The friends group solution doesn't work, unfortunately, because then people would have to be friends to see those post, but they are in fact my most public and widely-disseminated posts! I get some really interesting feedback on some of those...

I use some super awesome magic software to snort in massive amounts of information daily (including what comes out of my friends' LiveJournals), and I hardly ever visit my LJ friends page (it's so quaint, having to scroll down to the next entry and all!). For those people living in the future like me, LJ cuts don't work (but thankfully, they don't help, since the awesomeness makes it easy to skim over what we're not interested in really fast). So I often forget about them entirely.

Ah, if only the LiveJournal friends page was more like Google Reader, you too could live in the future! It's nice here, and we have drinks with little umbrellas in them! ;-)

But I'll try to use them more LJ cuts nonetheless, even if I can't see (or miss) them. :-)
Apr. 6th, 2007 08:38 pm (UTC)
I've been struggling with exactly the same problem and have made exactly as much progress solving it as you have. If you ever find a panacea, I'd appreciate a shout :)
Apr. 6th, 2007 08:52 pm (UTC)
About what? Reading LJ? As I mentioned, I use NetNewsWire, and it's like crack. :-)

As for providing a "public face" from my LJ, I think I've got something worked out. Feeds can now be filtered on a tag, so there's some stuff possible there, I think. For example, see my Advogato page. I'm planning on using Feedburner (see here for example) to cobble up something useful.
Apr. 6th, 2007 08:58 pm (UTC)
But we're still faced with the problem that our LJ "friends" have to see everyhting we write on their "friends page" because you can't filter that by tag.
Apr. 6th, 2007 09:01 pm (UTC)
I guess being diligent about using LJ cuts is pretty much all that we have at the moment...
Mar. 7th, 2007 08:01 pm (UTC)
pierre, hi: more complete answer on advogato - you need message passing.

i.e. you need a different operating system, because you're not going to persuade the linux kernel developers to redesign linux from a monolithic to a microkernel. well - you can try - but good luck with it.
Mar. 10th, 2007 07:44 pm (UTC)
Re: messages
By the way...
Mar. 7th, 2007 08:57 pm (UTC)
I like reading this kinda stuff, always a good reminder how many different things you can code and still be in the dark about some.
Mar. 7th, 2007 10:11 pm (UTC)
Do you need to at all?
I brought this up on the libevent mailing list a while ago. My pseudocode looked something like this:
         while there are events:
             dequeue one
             handle it
         if someThreadPolling:
             condition wait
             someThreadPolling = true
             poll for events
             someThreadPolling = false
             fire condition

That neglects the adding/removing entries; I agree a self-pipe would be the way to go there.

But William Ahern suggested that this isn't worth the complexity: it's better to just partition the file descriptors among the threads. It seems like there'd be some problems with unbalanced ones there. I promised benchmarks: if some simple balancing scheme scales nearly linearly, there's no point in working out this more complex multithreaded loop. I haven't had the time to carry through on that, even though it's been months...
Mar. 8th, 2007 04:28 pm (UTC)
select doesn't scale with number of file descriptors being watched
By the way, I don't believe this statement:
I've used threads in the past to work around deficiencies in select(), putting "quiet" file descriptors on one thread that would sleep most of the time, and "busy" file descriptors on another (since select() scales based on the number of file descriptors it watches, this made the active set more efficient)

select is O(n) with its n argument, which is the highest file descriptor plus one. Keep in mind that an FD_SET is just a bitvector. Both userspace and the kernel have to go through every entry in the vector to at least see if it's set. And in practice, this appears to be enough work that it makes some difference.

This is in contrast to poll(), which has the behavior you describe. You can see benchmarks of the difference here. Look at the "Linux 2.1.52-pre-patch-2 (includes poll patch) + select & fastpoll.v1 patches" numbers, which are most representative of systems made in the past however many years since the benchmark was done. You can see the select() times don't go down nearly as much when checking just the high descriptors as the poll() ones do.
Mar. 8th, 2007 04:56 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
Yeah, I am aware of the difference between the n argument and the number of file descriptors, but I figured I'd keep it out at this level, since they're both O(n), and quite correlated most of the time.

poll() is also less efficient than select() with densely filled set of file descriptors, which also appears when you throw enough at it. Which is why, in actuality, I used select() for the "quiet thread" (usually more quiet connections at any given time) and poll() in the "active thread". But if I had too many active file descriptors, the size of the poll() array would be the limiting factor.

I could have made something that switched between one or the other, but that was too complicated, introducing branches where there wasn't before, maintaining two sets of data structures where only one was used at any given time. I switched to epoll instead. ;-)
Mar. 8th, 2007 05:02 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
Sounds like a good solution. If a server doesn't have epoll(), kqueue(), or /dev/poll, I wouldn't want to run a server on it. And unless each one is already implemented for me (i.e., libevent does what I want), I'm probably just going to pick the one I'm intending to deploy with.
Mar. 8th, 2007 09:14 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
As I was saying recently, what I really want is an edge-triggered event delivery mechanism. Really, something like epoll is in edge-triggered mode is what I'm after, and I would have considered libevent, but it's level-triggered. So I'll probably be hacking my little abstraction layer myself.

The thing is, I want something crossplatform (so it should be able to get by with just select() or poll()) and that can scale up and down, both in load and in machine specs. For the higher loads, it will have to be something good like epoll or kqueue(), but for many other situations on weird little platforms, select() will be the only way, and should be sufficient.

Is libevent getting fat, these days? A web server? Isn't that my job? :-)
Mar. 8th, 2007 09:23 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
I think libevent's level-triggered nature could be changed. I don't think much in the code really cares either way, and I've saw at one point a patch that introduced a way to check the capabilities of the currently-chosen event mechanism, and a flag for adding edge-triggered events. I don't think it ever got sent to libevent-devel, though. Niels Provos seems pretty reasonable about applying patches that make sense.

And I don't think libevent's getting too heavy. I'm not planning on using any of the fancy web stuff, but on a standard Linux system I don't care too much about the code size. If I did, I'd push for putting it into a separate .so, then everyone would be happy. (-levhttp -levent or something).

One major reason you might not want to use libevent is if it can't do what you want multithreaded-wise. I don't think that crowd will be too supportive of the sort of complicated behavior you want, and I'm not sure if you could/would want to implement it as a wrapper around libevent rather than changes to the core.
Mar. 8th, 2007 09:52 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
I think libevent is making the right decisions, considering where it is, but you're right, I the symmetrical multithreading stuff. I already use asymmetrical partitions between application domains, this symmetrical multithreading is for going wide on a single specific application, so that's pretty much how it has to be.

Note that in your example of TCP proxying, there's not much locking involved. Each direction is it's own state machine (watch for read, read into a buffer, watch for write, write the buffer out, repeat), and since each is watching only one file descriptor at a time, it can all be done lockless, even if both sides are being read from at the same time. Amazinging, isn't it?
Mar. 8th, 2007 09:53 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
Amazinging, a new word that comes up when you have multiple threads writing to the same file descriptor.
Mar. 8th, 2007 10:01 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
Are you thinking of an echo server? No, in a proxy the two descriptors are very interrelated. The southbound/client-side/whatchamacallit input buffer is the same as the northbound output buffer. Likewise, the southbound output buffer is the same as the northbound input buffer. A shared lock on that buffer is definitely necessary, and furthermore operations on one descriptor can trigger operations on the other descriptor: A write from a full buffer means the other descriptor is again eligible for read. A read to an empty buffer means the other side is again eligible for write. Similar transitions exist in the opening/closing states.
Mar. 9th, 2007 06:33 am (UTC)
Re: select doesn't scale with number of file descriptors being watched
While I'm writing to the one side, I let the kernel do the input buffering on the other side, not reading at all. This allows for reading bigger chunks, in less system calls. It might result in slightly less throughput for a single connection, but the overall throughput of the system will be higher (less waking up, and when waking up, more often with a full read, on top of less locking necessary).
Mar. 9th, 2007 08:08 am (UTC)
Re: select doesn't scale with number of file descriptors being watched
Oh, now I see what you're saying. Interesting.

There's at least one corner case I don't quite understand. Say you have empty buffers and your stateful firewall sends you a pair of RST packets. You had to have been watching both descriptors, and they both become available at once. Two threads simultaneously start working on opposite ends, locking their half and attempting to read into their buffer. The read fails. Now each wants to destroy the pair. What happens?

The obvious answer would be for a failed read to lock both sides, then do the destroy. But it can't lock the other side safely without relinquishing its own lock, or the order would be inconsistent and it might deadlock. Neither can it relinquish its own lock, or the other might have destroyed it in the meantime. Seems like you have to follow a tricky procedure to destroy pairs...and (perhaps due to the hour), it's eluding me.
Mar. 10th, 2007 11:47 am (UTC)
Re: select doesn't scale with number of file descriptors being watched
You say "locking their half", which I said wasn't needed, so not the case. :-)

When they read an EOF, they "write" a shutdown(SHUT_WR) on the other end, that much is for sure. After that, we won't be getting anything to read, for sure, so we won't run this handler again.

The question is, after that, did the other side already do that? If it did, we can just close both file descriptors and destroy the corresponding state. If not, we can just return, never to be called again (quite possibly removing the interest in the read events, but we could leave that to the destructor too, it's not critical).

Enter the "half_closed" bool, protected by a mutex. It's kind of silly, because it's only ever accessed at the end of the process, but at least there won't be much contention on that mutex!

Note that for more complex interactions between multiple file descriptors, I have a containing object that can make sure all the callbacks are single-threaded between each others (something a bit like COM apartments, but not quite the same).
Mar. 10th, 2007 11:20 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
I'm talking about RST (read() returns -1, errno=ECONNRESET), not FIN (read() returns 0). In this case, I alter the linger settings and close() the opposite socket, forcing it to relay the RST. Likewise timeout settings - it doesn't make sense to send a FIN on timeout; it'd be quite deceptive. I don't know of any way to force an RST without taking down the entire socket.
Mar. 11th, 2007 03:23 pm (UTC)
Re: select doesn't scale with number of file descriptors being watched
True. It is somewhat tricky, yes... Maybe set a bool, then shut down the other way, so that when it tries to read, it gets an EOF (I'm not sure about that one, though), sees the bool and closes the sockets.

Otherwise, it's back to locks, which sucks when it's just for a specific exceptional case...
( 25 comments — Leave a comment )