He's unfortunately correct about a shared-everything concurrency model being too hard for most people, mainly because the average programmer has a lizard's brain. There's not much I can do about that, unfortunately. We might be having an issue of operating systems here, rather than languages, for that aspect. We can fake it in our Erlang and Newsqueak runtimes, but really, we can only pile so many schedulers up on each others and convince ourselves that we still make sense. That theme comes back later in this post...
His analogy of memcached with NUMA is also to the point. While memcached is at the cluster end of the spectrum, at the other end, there is a similar phenomenon with SMP systems that aren't all that symmetrical, multi-cores add another layer, and hyper-threading yet another. All of this should emphasize how complicated writing a scheduler that will do a good job of using this properly is, and that I'm not particularly thrilled at the idea of having to do it myself, when there's a number of rather clever people trying to do it in the kernel.
What really won me over to threading is the implicit I/O. I got screwed over by paging, so I fought back (wasn't going to let myself be pushed around like that!), summoning the evil powers of mlockall(). That's where it struck me that I was forfeiting virtual memory, at this point, and figured that there had to be some way that sucked less. To use multiple cores, I was already going to have to use threads (assuming workloads that need a higher level of integration than processes), so I was already exposed to sharing and synchronization, and as I was working things out, it got clearer that this was one of those things where the worst is getting from one thread to more than one. I was already in it, why not go all the way?
One of the things that didn't appeal to me in threads was getting preempted. It turns out that when you're not too greedy, you get rewarded! A single-threaded, event-driven program is very busy, because it always finds something interesting to do, and when it's really busy, it tends to exhaust its time slice. With a blocking I/O, thread-per-request design, most servers do not overrun their time slice before running into another blocking point. So in practice, the state machine that I tried so hard to implement in user-space works itself out, if I don't eat all the virtual memory space with huge stacks. With futexes, synchronization is really only expensive in case of contention, so that on a single-processor machine, it's actually just fine too! Seems ironic, but none of it would be useful without futexes and a good scheduler, both of which we only recently got.
There's still the case of CPU intensive work, which could introduce trashing between threads and reduced throughput. I haven't figured out the best way to do this yet, but it could be kept under control with something like a semaphore, perhaps? Have it set to the maximum number of CPU intensive tasks you want going, have them wait on it before doing work, post it when they're done (or when there's a good moment to yield)...
One of the great ironies of this, in my opinion, is that Java got NIO almost just in time for it to it to be obsolete, while we were doing this in C and C++ since, well, almost forever. Sun has this trick for being right, yet do it wrong, it's amazing!