I was wrong on two points.
Number 1 Mistake:
I thought that there would be a slow query in the mysql slow query log that, with an index, would magically make Drupal 6 faster.
There wasn't.
There were a few slow queries, but they had perfectly good indexes, and they were slow, as far as I can tell from the symptoms, due to thread contention, as described here:
Facebook Doubles MySQL Throughput with PMP
Unfortunately, there are three problems with the solution there.
1) It would take me forever to figure out what those guys are really doing
2) We'd never be able to run un-approved MySQL patches like that
3) There's no way my VP of Marketing is pounding our server hard enough for this to be the problem I'm trying to solve.
So as much as I'd love to dive in and follow in Domas' and Mark's footsteps, that's not going to happen.
Number 2 Mistake:
In my previous article, I stated that I had attempted to eliminate some obvious candidates, and one of those was:
"It only fails after a cache flush? No."
I was wrong.
So very very wrong.
I had eliminated the manual cache flushes done by our team as a source of the problem.
What I didn't know was that Drupal 6 (and 7) has a brain-dead hard-wired notion that a "Page" in cache_page should be flushed every time cron runs.
Yes, there is a setting for cache_lifetime on your admin "Performance" page, but page_set_cache ignores that silly little administrative option.
(What actually honors that setting is perhaps an even more interesting architectural decision than flushing the Page cache every cron run.)
This is apparently a known issue as described here:
Drupals cron kills you
Note that a couple high-profile Drupal community members who should know better make blatantly wrong statements in the Comments.
I have broken the rules and patched Drupal core, and probably broken polls (we don't use them).
In your includes/common.inc file, you can apply a patch like this:
cat common.inc.patch_cache_page_honor_cache_lifetime
--- common.inc.original 2011-09-13 09:37:30.000000000 -0500
+++ common.inc 2011-09-13 09:39:04.000000000 -0500
@@ -2701,7 +2701,7 @@
$data = gzencode($data, 9, FORCE_GZIP);
}
ob_end_flush();
- cache_set($base_root . request_uri(), $data, 'cache_page', CACHE_TEMPORARY, drupal_get_headers());
+ cache_set($base_root . request_uri(), $data, 'cache_page', variable_get('cache_lifetime', CACHE_TEMPORARY), drupal_get_headers());
}
}
}
Our Drupal pages now are cached honoring the cache_lifetime setting in Performance (as I expected for the past 18+ months) and we got consistent 100 requests per second across 15 multisites in loadtests last night, as we should.
I probably should have run more nights of loadtests before posting this, but I wanted to get this typed up while it was fresh, and, really, now that I know what's going on, I'm pretty confident of this solution.
Actually, I might just change that again to CACHE_PERMANENT, as caching a page only for a day when they change very seldom is just silly. But that's probably just us. Most Drupal users probably expect the behaviour of the patch above, based on the UI.
Here's one site's zoomed in graph comparing the 13th (with Drupal pages at 0 for failing to respond at all to 90% of my ab -n 1000 -c 100) but on the 14th, similar pages get ~100 requests per second.
Note that the highest peaks in the above graph are static images and JS/CSS files, not Drupal pages.
I'm expecting my loadtest graphs to get much smoother now, with no 0s littered everywhere.
2 comments:
To add some perspective:
With 1000 requests, 100 concurrent.
Before:
1% requests HTTP 2xx
99% requests timed out
After:
100% requests HTTP 2xx
In both cases, the requests that actually succeeded were around 100 requests per second.
But a 99% failure rate for a heavy spike in HTTP requests isn't really acceptable...
Heh. I had a similar thing with the Tapestry Web UI framework for Java. It had a setting called "development mode" that someone set when they were first creating the configuration file. That mode caused Tapestry to rescan the entire set of web page definitions on disk every second (yes, that's every 1 second). We had customers complaining that they would occasionally get out of memory exceptions from Java when there was plenty of available heap. Turns out that every time Tapestry was scanning the directories, it was creating Java objects for file I/O that used native memory. Those objects could not be reclaimed immediately by garbage collection (you had to wait for their "finalize" methods to run). Because we had a huge application with 1000's of web page definitions, we were running out of native memory. All because someone set a setting without telling anyone else...
Post a Comment