I was wrong on two points.
Number 1 Mistake:
I thought that there would be a slow query in the mysql slow query log that, with an index, would magically make Drupal 6 faster.
There were a few slow queries, but they had perfectly good indexes, and they were slow, as far as I can tell from the symptoms, due to thread contention, as described here:
Facebook Doubles MySQL Throughput with PMP
Unfortunately, there are three problems with the solution there.
1) It would take me forever to figure out what those guys are really doing
2) We'd never be able to run un-approved MySQL patches like that
3) There's no way my VP of Marketing is pounding our server hard enough for this to be the problem I'm trying to solve.
So as much as I'd love to dive in and follow in Domas' and Mark's footsteps, that's not going to happen.
Number 2 Mistake:
In my previous article, I stated that I had attempted to eliminate some obvious candidates, and one of those was:
"It only fails after a cache flush? No."
I was wrong.
So very very wrong.
I had eliminated the manual cache flushes done by our team as a source of the problem.
What I didn't know was that Drupal 6 (and 7) has a brain-dead hard-wired notion that a "Page" in cache_page should be flushed every time cron runs.
Yes, there is a setting for cache_lifetime on your admin "Performance" page, but page_set_cache ignores that silly little administrative option.
(What actually honors that setting is perhaps an even more interesting architectural decision than flushing the Page cache every cron run.)
This is apparently a known issue as described here:
Drupals cron kills you
Note that a couple high-profile Drupal community members who should know better make blatantly wrong statements in the Comments.
I have broken the rules and patched Drupal core, and probably broken polls (we don't use them).
In your includes/common.inc file, you can apply a patch like this:
cat common.inc.patch_cache_page_honor_cache_lifetime --- common.inc.original 2011-09-13 09:37:30.000000000 -0500 +++ common.inc 2011-09-13 09:39:04.000000000 -0500 @@ -2701,7 +2701,7 @@ $data = gzencode($data, 9, FORCE_GZIP); } ob_end_flush(); - cache_set($base_root . request_uri(), $data, 'cache_page', CACHE_TEMPORARY, drupal_get_headers()); + cache_set($base_root . request_uri(), $data, 'cache_page', variable_get('cache_lifetime', CACHE_TEMPORARY), drupal_get_headers()); } } }
Our Drupal pages now are cached honoring the cache_lifetime setting in Performance (as I expected for the past 18+ months) and we got consistent 100 requests per second across 15 multisites in loadtests last night, as we should.
I probably should have run more nights of loadtests before posting this, but I wanted to get this typed up while it was fresh, and, really, now that I know what's going on, I'm pretty confident of this solution.
Actually, I might just change that again to CACHE_PERMANENT, as caching a page only for a day when they change very seldom is just silly. But that's probably just us. Most Drupal users probably expect the behaviour of the patch above, based on the UI.
Here's one site's zoomed in graph comparing the 13th (with Drupal pages at 0 for failing to respond at all to 90% of my ab -n 1000 -c 100) but on the 14th, similar pages get ~100 requests per second.
I'm expecting my loadtest graphs to get much smoother now, with no 0s littered everywhere.