Log in

No account? Create an account

Previous Entry | Next Entry

The End Of The World (As We Know It)!

Ok, here we go:

Event-driven non-blocking I/O isn't the way anymore for high-performance network servers, blocking I/O on a bunch of threads is better now.

Wow, I can't believe I just wrote that! Here's a post that describes some of the reasons (this is talking more about Java, but the underlying reasons apply to C++ as well, it's not just JVMs getting wackier at optimizing locking). It depends on your platform (things don't change from being true to being false just out of the blue!), and more specifically, I have NPTL-based Linux 2.6 in mind, at the very least (NPTL is needed for better futex-based synchronization, and 2.6 for the O(1) scheduler and low overhead per thread). You also want to specify the smallest stacks you can get away with, and you also want a 64-bit machine (it has a bigger address space, meaning it will explode later).

The most important thing you need is to think and not be an idiot, but that's not really new.

And when I say "bunch of threads", I really mean it! My current "ideal design" for a web server now involves not just a thread per connection, but a thread per request (of which there can be multiple requests per connection)! Basically, you want one thread reading a request from the socket, then once it's read, fork it off to let it do its work, and have the writing of the reply to the socket be done on the request thread. This allows for as much pipelining as possible.

Still, event-driven I/O is not completely useless, it is still handy in the case of protocols that have long-lived connections which stay quiet for a long time. Examples of that are IRC and LDAP servers, although it's possible that with connection keep-alive, one might want to do that with an HTTP server as well, using event notification to see that a request is arrived, then hand it back to a thread to actually process it.

I also now realize that I was thinking too hard in my previous thoughts on using multiple cores. One could simply have a "waiting strategy" (be it select() or epoll), and something else to process the events (an "executor", I think some people call that?). You could then have a simple single-threaded executor that just runs the callbacks right there and then, no more fuss (think of WvStreams' post_select()), or you could have a fancy-pants thread-poll, whatever you fancied. I was so proud of my little design, now it's all useless. Oh well, live and learn...


( 16 comments — Leave a comment )
May. 17th, 2008 12:35 am (UTC)
today, multithreading is the way to go. CPU manufacturer are cheating with the Moore by increasing the power horizontally: more core same power-per-unit. But then you have to prevent all the problem associated to concurrent programming.

At that point why bothering with asynchronous IO? Most of the time you need to know the result to continue, so why not isolating the sequencial task in its own thread and leave the rest to it's business?

Maybe I should just have said:
Me too
May. 17th, 2008 04:43 pm (UTC)
Well, there used to be some problems. For example, the synchronizations primitives provided by LinuxThreads (pre-2.6) had rather high overhead, and without the O(1) scheduler, more than a few hundred processes/threads would bog down the system. There wasn't as good memory allocators either, which would make the cache lines play ping-pong between the processors. Some systems would allocate something ridiculous for the stack, like one megabyte or more, making for less than 2000 threads possible in a single process before running out of virtual memory space.

Even now, some things are complicated. I'm told that some printf() implementations use more than 16 KB of stack, so if you run with a very small stack, there's some functions that just become unusable.

So it's not all a given. If you have a dual quad-core Xeon and Linux 2.4, you'd still have wanted an event-based design, with multiple threads to use all the cores, yes, but not one per connection or something, more like one or two per available core.
May. 17th, 2008 05:40 pm (UTC)
for the memory allocator, jemalloc seems to address most of the issues. The Gnash guys have been using it in lieu of malloc() and saw a huge performace boost. FireFox use it :-) (for them it is mostly for small allocation, but it was developped with multi-threading in mind).

BTW nobody use 2.4 anymore.
May. 17th, 2008 09:07 pm (UTC)
Something like jemalloc would do the trick, yeah. There's also Hoard and tcmalloc that do well for that kind of situation, from what I'm told. I've only seen numbers for glibc's malloc vs tcmalloc, and it's a huge boost, and jemalloc uses a similar kind of design, from what I can see.

I've had to support 2.4 until not so long ago, it's been annoyingly clingy, for some reason. I think that reason was RHEL, which kept using a 2.4 kernel for a ridiculously long time, and while the new 2.6-based version was out, the kind of people who use RHEL are also really slow at upgrading...

But I was just pointing that out in the sense that back with 2.4, threads weren't sensible, and now they are.
May. 17th, 2008 10:17 pm (UTC)
There is also this http://www.hpl.hp.com/research/linux/atomic_ops/
for atomic ops (mutexes ;-))

I haven't had a look at it yet.
May. 17th, 2008 10:28 pm (UTC)
Atomics are very cool and useful, yes! And one of their most interesting characteristic is not being a mutex! ;-)

shared_ptr uses an atomic type for its reference count, if you combine that with pointing at something const (immutable) and overwriting the shared_ptr, you can make something not unlike RCU, with all the goodness that this entails!
May. 17th, 2008 02:19 am (UTC)
You know the rules!
Each time you post something I don't understand, I highjack the post with women pictures!

May. 17th, 2008 10:18 pm (UTC)
Re: You know the rules!
that's a hacker background. in some cube farms there would be people freaking out for that.
May. 17th, 2008 10:28 pm (UTC)
Re: You know the rules!
Thankfully, we steer clear of such dreary places!
May. 17th, 2008 10:54 pm (UTC)
Re: You know the rules!
I have been in one of these. That was a big mistake.
May. 17th, 2008 02:35 am (UTC)


I'll have to think about that. I still wonder about latency issues. How much does it cost to move data from one core's cache to another's? Wouldn't it make more sense for threads to be data-centric in order to decrease latency and make it more likely that something will be in cache?

It's clear that some level of threading is a good idea nowadays.

One interesting thing that Java has done that I think is an overall useful concept is making some data structures immutable so that no locking operations are required to access them.

This is going to require some thinking.

May. 17th, 2008 05:57 pm (UTC)
The latency is a good point, but maybe not the way you seem to think. Latency is actually a common problem, because where some blocking on a thread-per-connection design will allow other threads to take over the CPU, but an event-driven system had to exercise extreme discipline to avoid any delays for the other connections.

For the cache, that's another good point to bring up, but is also in favour of threads now. Good schedulers already have CPU affinity taken into account, and each thread (should!) operate on its own data as much as possible, where a multi-threaded event-driven system has each thread operating on many unrelated data structures. You can then have a "thread affinity" for each "task", so that they're hopefully scheduled on the same processor, but that's just replicating what the kernel already does for threads themselves.

I was certainly already considering "some level of threading", but still thinking that an overall event-driven system is better, with a small number of threads (some number derived from the number of cores). But now I'm thinking lots of threads with blocking I/O, and maybe keeping an event-driven part just for dealing with idle connections, no more.

I find interesting the history of data structures in Java, where they started, a bit naively, with data structures that had a lot of locking (every method being synchronized, for example, heralded as the future back them!), and things evolving over time, now with some interesting data structures geared specifically toward multi-threading, such as concurrent hash maps, for example.

Immutable data structures have had good influence, indeed, for example, IO-Lite, maybe not directly inspired by that, but sharing the same principles. I remember seeing something where there was a reliance on garbage collection to extract the best out of this, but thankfully, reference counting (as provided by shared_ptr, say) is apparently sufficient. I need to look this up...
May. 17th, 2008 07:48 pm (UTC)

I was thinking of having a 'thread per context' instead of a 'thread per request' might be a good idea to attempt to take maximum advantage of the affinity scheduling of most modern processors. A thread per request doesn't give the scheduler enough information and each request in the same context may well be handled on a different CPU.

It would be nice if thread schedulers could use information about the contents of CPU caches to assign CPU affinity. Maybe some kind of special table mapping CPU number to physical memory ranges.

Though for a request of more than 10-20 milliseconds the overhead of pushing context between CPUs on the same box is likely negligible. For a request of 100-200 milliseconds the overhead of pushing context between CPUs on different boxes in the same datacenter is likely negligible.

May. 17th, 2008 10:22 pm (UTC)
What's the difference between "context" and "request" in this situation? In my example, I was thinking of something like an HTTP server that has a low amount of shared data between each requests, serving static files or proxying to backend HTTP servers, for example.

Now, what you're asking about mapping the content of CPU cache is basically what having a 1:1 mapping of "tasks" (what you call a context?) to threads does. You use the same data in a thread, and the scheduler's CPU affinity mechanism tries gets you back to the CPU with the cache content it needs. Multiplexing multiple tasks in a single thread obfuscates this information from the scheduler, unfortunately, so you end up having to do it in userspace.

I've been wishing for all sorts of things, like an non-blocking mlock() that sends an event when the pages are in core, but in the end, fixing threads to not suck is also another option I should have considered, and in a way is much more general, if done right. Maybe dangerously so, but what can you do...

About migrating between boxes in a datacenter, from the kind of loads I've seen that could use that often can't, because they also tend to operate on large datasets. Being bound by the memory bus is coming up more and more often, at the very edge of what we can do, it seems...

I'll follow-up with another post soon, keep an eye out!
May. 17th, 2008 04:35 pm (UTC)
We've been discussing this exact topic on the Gluster mailing list lately. I've been running some benchmarks and tests with a couple different blocking / non-blocking translators (plugins that add functionality, in the vernacular), and have been seeing great results with high thread-count blocking I/O scenarios.

May. 17th, 2008 05:05 pm (UTC)
If you're not already, link your application with a memory allocator that's better tuned for threads, like tcmalloc, and it might go even faster! The default glibc allocator isn't so hot with multiple threads, in some applications this makes a significant difference (see this article).
( 16 comments — Leave a comment )