Wherein I occasionally rant on various topics including, but not limited to, PHP, Music, and whatever other Topics I find interesting at the moment, including my brain tumor surgery of August 2010.

Thursday, September 08, 2011

Drupal 6 Performance

There are a lot of things I like about Drupal.

Performance, however, is not one of them.

Yes, I turned caching on.
Yes, CSS and JS Optimization are on.
Yes, I used YSlow.
Yes, the GA code is at the bottom.
Yes, the images are small.
Yes, our custom HTML templates validate.
Yes, I have fought off the business unit's demands to install modules recommended by SEO "experts" (cough, cough) which are completely inapplicable to our situation. (Example: When all your content is input by staff, you really don't need that spam detector in the cloud service module. If our staff start putting in Viagra ads, we'll just fire them.)

And, yes, I have a problem.

Well, several problems, actually, but if the preceding example didn't already clue you in that the business process is quite disfunctional, and that was only the tip of the iceberg, you haven't worked for enough large corporations, and there is nowhere near enough room in this blog post to pull that iceberg into view. I'll get back on focus now.

Here's the thing: Most of the time, Drupal is hitting its built-in cache with an indexed query of the entire HTML page for an anonymous user.
This is, naturally, quite acceptable, as it simply bootstraps a few too many PHP files (minor annoyance) and runs a single indexed query on a URL->HTML relationship and spits out the page and exits.

Unfortunately, my performance "pre-emptive monitoring. quasi-loadtest, and let's make sure none of these sites will make PRODUCTION fall over, and let's catch when the Drupal guy didn't turn caching on" ad hoc process, indicates that occasionally, some URL is performing abysmally [viz].

I am not an expert in testing, nor Drupal, nor mysql. But I have spent 25+ years specializing in PHP applications, so know a little bit about all those things. If my methodology is a bit "off", I am pretty confident the results are still "close enough".

Specifics:

I have 15 multisites same-codebase different-database D6 sites on a STAGING box, where business units QA (sort of) and approve (or not, or shouldn't have but did) before launch to PRODUCTION. Staging is a 2-tier setup, with 1 www node and 1 mysql node. We're running stock RHEL plus EPEL builds for the most part. PHP 5.3, MySQL 5.1. If you've done Drupal multisite, I haven't done anything particularly weird or interesting here.

There are also a couple "baseline" sites, with just static HTML pages, "phpinfo();", and a "mysql_connect(); mysql_query('select version');" for comparison purposes with the real sites, just so I know that there's nothing inherently wrong with Apache, PHP, nor PHP-to-mysql.

At night, while real people sleep, I spider each site on a configurable schedule, mostly weekly, and store all the URLs in a flat file per site. In addition, every 5 minutes, an ab loadtest fires for the "next" site in a queue for a randomly-selected URL from that site:

ab -n 1000 -c 100 $random_url_here


Some of the URLs are images. Some are CSS/JS. Some are actual Drupal nodes.

I record the ab result files, prepending a line to tell me which URL was requested.

I also log the requests per second in a database table, counting a big fat 0 for any errors I can detect. For example, if there were any non-2xx HTTP response codes, it's counted a complete failure in my book. If any of the 1000 requests never came back, again, big fat 0. (This is where testing experts are probably crying in their coffee. Sorry.)

I have some kludgy-hand made PHP|GD graphs.
(Yes, I'm old-school and can visualize and draw a halfway decent graph in PHP code faster than I can fight your fancy new-fangled tools to get a really nice graph. Sorry.)

Here is what a sample site graph looks like:

(The density on the left is because I used to run the loadtests all day every day, but scaled back to night-time in the middle of this process. Yes, testing experts everywhere are crying in their coffee. Sorry.)

First thing to look at is the bottom center stats min/avg/max: 0/265/1010
That means this site is getting a very acceptable "average" 265 requests per second across all the ab runs you see in the graph.
And that 1010 means that one URL excelled and had 1,010 requests per second in an ab run.
Alas, there is a 0 for that min, and that's the problem.

In fact, looking at the green lines of the graph itself, you see a fair number of 0 data points.

I won't bore you with hundreds of ab text outputs, nor the tedious process of matching them up with the dots on the graph, but here's the basic trend:
Static CSS/JS and even images come in first place with the peaks in the hundreds of requests per second.
Most Drupal URLs come in acceptable levels around 100 requests per second.
Occasional Drupal URLs just plain bomb out with, for example, only 8 successful requests, and 992 failures and get the big fat 0.

I have also attempted to eliminate the following:
  • The same URL fails all the time? No.
  • It only fails after a cache flush? No.
  • It only fails when a loadtest runs so long it concurs with another? No, I added code to insure only one loadtest at a time. (Testing experts crying over coffee. Sorry.)
  • Only certain sites? No.
  • MySQL hasn't been tuned? No. We tuned it.
  • APC isn't working? No. It's fine.
  • Uncached Drupal performance isn't that bad. Yes. It is. I measured it with the same (sloppy) metrics above. A simple minimal setup with stock D6 and a few pages / modules falls over after 8 requests in heavy load, without caching.

So my theory, and I admit it's not really proven, since I don't have the mad Drupal skillz to log the info to prove it, is that the setting in Performance "Minimum cache lifetime: <none>" does not, in fact, keep URLs in the cache indefinitely, and something in Drupal purges them, for reasons beyond my ken.

NOTE:
These are brochureware sites with a hundred nodes each, and a couple webforms. There are the usual Drupal modules one would expect for that. Plus all the modules I was forced to add that they never actually figure out how to configure and use that the SEO "experts" (cough, cough) told them they need.

A lot of Drupal experts have recommended PressFlow and varnish and so on. We are considering that. But even a drop-in replacement isn't that easy to manage when we are rapid-fire launching a new multisite each week. We simply don't have the bandwidth for that right now.

And, ultimately, I find it offensive that uncached Drupal falls over after 8 requests on a 2-tier php/mysql 6G RAM setup. (Assuming my theory is correct.)

It's like somebody bolted a billboard to the back of our Maserati.

Then tossed some kind of broadcast instant replay device named "cache" in front of the billboard to mask the problem.

And are now recommeding another broadcast replay device in front of the "cache" called "varnish".

So am I just going to whine about it? No. I'm going to start analyzing queries in log_slow_queries from MySQL and whatever else I can find, and try to see if I can add some indexes to make uncached Drupal reasonably performant.

Of course, real Drupal developers are all madly coding Drupal 8, since Drupal 7 was released a few months ago, and Drupal 6, to them, is just not worth patching, so my efforts will probably be largely irrelevant to the Drupal community at large. Sorry.

You'll have to wait for the next blog post to see if that succeeds or not. I may even have to eat some crow if I can't make any improvement at all.

Again, I do like a lot of things about Drupal. Performance, however, and the architecture to make it performant, is not one of them.

3 comments:

nathan said...

I wonder if it isn't long running queries so much as a gratuitous quantity of queries ... I've submitted my prediction sir!

Richard Lynch said...

@nathan
You are half-right.
Drupal uses about 100 queries to build the simplest page the first time.
That's just bound to be problematic without the cache.
But PHP/MySQL can keep up fairly well, and fail more like 5/1000 hits, rather than failing 992/1000.
But see my follow-up article to find out where the real problem lies.

Macrosoft said...

Nice information about drupal 6.

Drupal Development Company