Wherein I occasionally rant on various topics including, but not limited to, PHP, Music, and whatever other Topics I find interesting at the moment, including my brain tumor surgery of August 2010.

Tuesday, October 11, 2011

Drupal 6 Performance, Part 3

After weeks of no problems in STAGING and loadtests looking great, it turns out there's just one teensy little flaw in caching Drupal pages longer than a couple days...

Drupal nukes the CSS/JS optimized/consolidated temp files every couple days.

So the monitoring sees a nice valid HTML from an HTTP 200, but humans see a theme-less site.

Worse, the thing corrects itself somehow, or our guys jump on it and flush the cache, and it's all good...

For a couple days.

Rinse, repeat.

Digging into the Drupal caching source code even deeper, and I'm convinced: This things wasn't architected; It just grew.

It's a spaghetti code mess.

I defy any Drupal core dev to correctly describe the caching "architecture" of D6 pages, CSS, JS, blocks, views, etc in any coherent way.

I officially give up now.

When the VP of Marketing complains the site is slow, I'll just say "Yes, it is. That's Drupal."

We'll have to spider the silly thing after every cron job, actually, instead of doing that. Which sucks, but there it is.

Wednesday, September 14, 2011

Drupal 6 Performance, Part 2

First, I was wrong.

I was wrong on two points.

Number 1 Mistake:

I thought that there would be a slow query in the mysql slow query log that, with an index, would magically make Drupal 6 faster.

There wasn't.

There were a few slow queries, but they had perfectly good indexes, and they were slow, as far as I can tell from the symptoms, due to thread contention, as described here:
Facebook Doubles MySQL Throughput with PMP

Unfortunately, there are three problems with the solution there.
1) It would take me forever to figure out what those guys are really doing
2) We'd never be able to run un-approved MySQL patches like that
3) There's no way my VP of Marketing is pounding our server hard enough for this to be the problem I'm trying to solve.

So as much as I'd love to dive in and follow in Domas' and Mark's footsteps, that's not going to happen.

Number 2 Mistake:

In my previous article, I stated that I had attempted to eliminate some obvious candidates, and one of those was:
"It only fails after a cache flush? No."

I was wrong.

So very very wrong.

I had eliminated the manual cache flushes done by our team as a source of the problem.

What I didn't know was that Drupal 6 (and 7) has a brain-dead hard-wired notion that a "Page" in cache_page should be flushed every time cron runs.

Yes, there is a setting for cache_lifetime on your admin "Performance" page, but page_set_cache ignores that silly little administrative option.
(What actually honors that setting is perhaps an even more interesting architectural decision than flushing the Page cache every cron run.)

This is apparently a known issue as described here:
Drupals cron kills you

Note that a couple high-profile Drupal community members who should know better make blatantly wrong statements in the Comments.

I have broken the rules and patched Drupal core, and probably broken polls (we don't use them).

In your includes/common.inc file, you can apply a patch like this:

cat common.inc.patch_cache_page_honor_cache_lifetime 
--- common.inc.original 2011-09-13 09:37:30.000000000 -0500
+++ common.inc  2011-09-13 09:39:04.000000000 -0500
@@ -2701,7 +2701,7 @@
         $data = gzencode($data, 9, FORCE_GZIP);
       }
       ob_end_flush();
-      cache_set($base_root . request_uri(), $data, 'cache_page', CACHE_TEMPORARY, drupal_get_headers());
+      cache_set($base_root . request_uri(), $data, 'cache_page', variable_get('cache_lifetime', CACHE_TEMPORARY), drupal_get_headers());
     }
   }
 }


Our Drupal pages now are cached honoring the cache_lifetime setting in Performance (as I expected for the past 18+ months) and we got consistent 100 requests per second across 15 multisites in loadtests last night, as we should.

I probably should have run more nights of loadtests before posting this, but I wanted to get this typed up while it was fresh, and, really, now that I know what's going on, I'm pretty confident of this solution.

Actually, I might just change that again to CACHE_PERMANENT, as caching a page only for a day when they change very seldom is just silly. But that's probably just us. Most Drupal users probably expect the behaviour of the patch above, based on the UI.

Here's one site's zoomed in graph comparing the 13th (with Drupal pages at 0 for failing to respond at all to 90% of my ab -n 1000 -c 100) but on the 14th, similar pages get ~100 requests per second.
Note that the highest peaks in the above graph are static images and JS/CSS files, not Drupal pages.

I'm expecting my loadtest graphs to get much smoother now, with no 0s littered everywhere.

Thursday, September 08, 2011

Drupal 6 Performance

There are a lot of things I like about Drupal.

Performance, however, is not one of them.

Yes, I turned caching on.
Yes, CSS and JS Optimization are on.
Yes, I used YSlow.
Yes, the GA code is at the bottom.
Yes, the images are small.
Yes, our custom HTML templates validate.
Yes, I have fought off the business unit's demands to install modules recommended by SEO "experts" (cough, cough) which are completely inapplicable to our situation. (Example: When all your content is input by staff, you really don't need that spam detector in the cloud service module. If our staff start putting in Viagra ads, we'll just fire them.)

And, yes, I have a problem.

Well, several problems, actually, but if the preceding example didn't already clue you in that the business process is quite disfunctional, and that was only the tip of the iceberg, you haven't worked for enough large corporations, and there is nowhere near enough room in this blog post to pull that iceberg into view. I'll get back on focus now.

Here's the thing: Most of the time, Drupal is hitting its built-in cache with an indexed query of the entire HTML page for an anonymous user.
This is, naturally, quite acceptable, as it simply bootstraps a few too many PHP files (minor annoyance) and runs a single indexed query on a URL->HTML relationship and spits out the page and exits.

Unfortunately, my performance "pre-emptive monitoring. quasi-loadtest, and let's make sure none of these sites will make PRODUCTION fall over, and let's catch when the Drupal guy didn't turn caching on" ad hoc process, indicates that occasionally, some URL is performing abysmally [viz].

I am not an expert in testing, nor Drupal, nor mysql. But I have spent 25+ years specializing in PHP applications, so know a little bit about all those things. If my methodology is a bit "off", I am pretty confident the results are still "close enough".

Specifics:

I have 15 multisites same-codebase different-database D6 sites on a STAGING box, where business units QA (sort of) and approve (or not, or shouldn't have but did) before launch to PRODUCTION. Staging is a 2-tier setup, with 1 www node and 1 mysql node. We're running stock RHEL plus EPEL builds for the most part. PHP 5.3, MySQL 5.1. If you've done Drupal multisite, I haven't done anything particularly weird or interesting here.

There are also a couple "baseline" sites, with just static HTML pages, "phpinfo();", and a "mysql_connect(); mysql_query('select version');" for comparison purposes with the real sites, just so I know that there's nothing inherently wrong with Apache, PHP, nor PHP-to-mysql.

At night, while real people sleep, I spider each site on a configurable schedule, mostly weekly, and store all the URLs in a flat file per site. In addition, every 5 minutes, an ab loadtest fires for the "next" site in a queue for a randomly-selected URL from that site:

ab -n 1000 -c 100 $random_url_here


Some of the URLs are images. Some are CSS/JS. Some are actual Drupal nodes.

I record the ab result files, prepending a line to tell me which URL was requested.

I also log the requests per second in a database table, counting a big fat 0 for any errors I can detect. For example, if there were any non-2xx HTTP response codes, it's counted a complete failure in my book. If any of the 1000 requests never came back, again, big fat 0. (This is where testing experts are probably crying in their coffee. Sorry.)

I have some kludgy-hand made PHP|GD graphs.
(Yes, I'm old-school and can visualize and draw a halfway decent graph in PHP code faster than I can fight your fancy new-fangled tools to get a really nice graph. Sorry.)

Here is what a sample site graph looks like:

(The density on the left is because I used to run the loadtests all day every day, but scaled back to night-time in the middle of this process. Yes, testing experts everywhere are crying in their coffee. Sorry.)

First thing to look at is the bottom center stats min/avg/max: 0/265/1010
That means this site is getting a very acceptable "average" 265 requests per second across all the ab runs you see in the graph.
And that 1010 means that one URL excelled and had 1,010 requests per second in an ab run.
Alas, there is a 0 for that min, and that's the problem.

In fact, looking at the green lines of the graph itself, you see a fair number of 0 data points.

I won't bore you with hundreds of ab text outputs, nor the tedious process of matching them up with the dots on the graph, but here's the basic trend:
Static CSS/JS and even images come in first place with the peaks in the hundreds of requests per second.
Most Drupal URLs come in acceptable levels around 100 requests per second.
Occasional Drupal URLs just plain bomb out with, for example, only 8 successful requests, and 992 failures and get the big fat 0.

I have also attempted to eliminate the following:
  • The same URL fails all the time? No.
  • It only fails after a cache flush? No.
  • It only fails when a loadtest runs so long it concurs with another? No, I added code to insure only one loadtest at a time. (Testing experts crying over coffee. Sorry.)
  • Only certain sites? No.
  • MySQL hasn't been tuned? No. We tuned it.
  • APC isn't working? No. It's fine.
  • Uncached Drupal performance isn't that bad. Yes. It is. I measured it with the same (sloppy) metrics above. A simple minimal setup with stock D6 and a few pages / modules falls over after 8 requests in heavy load, without caching.

So my theory, and I admit it's not really proven, since I don't have the mad Drupal skillz to log the info to prove it, is that the setting in Performance "Minimum cache lifetime: <none>" does not, in fact, keep URLs in the cache indefinitely, and something in Drupal purges them, for reasons beyond my ken.

NOTE:
These are brochureware sites with a hundred nodes each, and a couple webforms. There are the usual Drupal modules one would expect for that. Plus all the modules I was forced to add that they never actually figure out how to configure and use that the SEO "experts" (cough, cough) told them they need.

A lot of Drupal experts have recommended PressFlow and varnish and so on. We are considering that. But even a drop-in replacement isn't that easy to manage when we are rapid-fire launching a new multisite each week. We simply don't have the bandwidth for that right now.

And, ultimately, I find it offensive that uncached Drupal falls over after 8 requests on a 2-tier php/mysql 6G RAM setup. (Assuming my theory is correct.)

It's like somebody bolted a billboard to the back of our Maserati.

Then tossed some kind of broadcast instant replay device named "cache" in front of the billboard to mask the problem.

And are now recommeding another broadcast replay device in front of the "cache" called "varnish".

So am I just going to whine about it? No. I'm going to start analyzing queries in log_slow_queries from MySQL and whatever else I can find, and try to see if I can add some indexes to make uncached Drupal reasonably performant.

Of course, real Drupal developers are all madly coding Drupal 8, since Drupal 7 was released a few months ago, and Drupal 6, to them, is just not worth patching, so my efforts will probably be largely irrelevant to the Drupal community at large. Sorry.

You'll have to wait for the next blog post to see if that succeeds or not. I may even have to eat some crow if I can't make any improvement at all.

Again, I do like a lot of things about Drupal. Performance, however, and the architecture to make it performant, is not one of them.

Tuesday, August 16, 2011

Xdebug and RHEL 5.7

Much ado and misinformation about how to really profile or performance analyze PHP is all over the Internet.

Rebuffs also exist to much of that misinformation all over the Internet.

How to actually really set yourself up to properly profile PHP, not so much...

So, here's an attempt to provide a step-by-step process, at least for RHEL 5.7

If you run something else, you'll have to improvise a bit.

It's not like I really picked RHEL, so that probably deserves some background...

I'm mostly an old school compile from source guy for Apache, MySQL and PHP to get what I want.

I used to use RedHat/Fedora for the base Linux / desktop, and then download source for Apache, PHP, MySQL.

I switched to gentoo a few years ago, and admit I pretty much fail to update very often, but whatever.

I only have to maintain a handful of boxes, and update when there is a real need.

In my workplace, the sysadmin is a stock RHEL guy, but then he has to maintain hundreds of servers with puppet and monitor them with Nagios and...

Well, let's just say I completely understand that need.

And he HAS opened up to Dag's repos and a UIC or ULC or somesuch to get something I needed.

Alas, when you want to do serious debugging, profiling, or performance analysis, that's not good enough.

So I recently embarked on a task to build a "debug" desktop with the following goals:
  • have debug builds of php, apache, and mysql
  • have xdebug
  • have kCacheGrind / valgrind / callgrind to visualize bottlenecks
  • this box is used for debug build, nothing else, no other users but me
  • stray no further from our stock build than I have to
Of course, just installing the GUI is a pretty far straying, and I let RHEL push me into gnome rather than KDE. Not sure that was wise, but there it is.

Still, I tried to have minimal impact on the build.

Testing Experts/Pundits:
Note that I'll exit X windows once I'm ready to really test, and then fire it up again to examine the results. If having the X binaries on the box screws up my tests, so be it. I'm already replacing the REAL binaries of the code with these debug-enhanced ones anyway, so at that level, it's not "the same" anyway. Having a separate desktop with all these tools on it just didn't seem practical in my case. Your needs may vary.

I first attempted to get the debug builds of at least PHP, and hopefully Apache and MySQL, just in case.

Everything seemed to work with yum, but I spun my wheels for hours checking phpinfo() configure line for --disable-debug and trusting its output about debug build.

It turns out that the RHEL debug builds are actually not built into the same binary files that ./configure --enable-debug would do. Instead, the debugging symbols are installed in separate binaries in /usr/lib/debug, and tools like gdb and strace and the all-important (to me) "httpd -X" just "know" to look in there and load those debug symbols in on top of the non-debug binaries in the usual places.

That kind of blew my mind, but whatever, you know?

And why the httpd startup scripts and PHP don't know to look there is beyond me, but I suppose if I really need to force them to do it, I could find a way... I doubt I'll need that, as you'll see shortly.

Anyway, in order to actually get these debug builds, you have to do something like this:

yum install --enable-repo=rhel-debuginfo httpd-debuginfo mysql-debuginfo php53-debuginfo

Note that this installs only the debugging symbol binaries into /usr/lib/debug. If you actually want to install the software itself, you still need to do these if you didn't already:

yum install httpd mysql php53


You also want to install valgrind, which basically allows you to run your binaries under debugging conditions with a whole suite of tools. We'll look at just one tool, callgrind, because a) it's the only one I know so far and b) it's the one that makes pretty pictures that make it blindingly obvious where your bottlenecks are in your code.

yum install valgrind


Just to be sure you installed Apache and PHP, set up your usual PHP script for testing purposes:

echo "<?php phpinfo();?>" > /var/www/html/index.php

(Or edit /var/www/html/index.php with your favorite editor.)

Surf to http://localhost/ and see the PHP status page with lots of blue tables telling you everything you ever needed to know about your PHP install.

Now to prove you can generate a callgrind script, to generate a visual representation of where your code spends all its time:

/etc/init.d/httpd stop
valgrind --tool=callgrind --dump-instr=yes -v /usr/sbin/httpd -X

(Someday, RHEL will catch up to the rest of the world and that httpd will change to apache or apache2...)

Note that you are now running a single httpd instance/child/thread with debugging on full blast, instead of your usual high-performance httpd process.

You definitely do not want to do this on a production box, or even a shared dev box.

You need to be doing this on a sandbox all your own.
Really.
If you can't figure out why, stop reading now and go do something else.

Now, reload that phpinfo page in your browser.

Finally, kill that httpd process. Open another shell and do:

killall -9 httpd
callgrind_control -k

If you kill the process, and it's built up a HUGE file of data, it just dies before it writes the data. Don't do that.

Back in the shell where you did that valgrind command, it will have exited. It will also have dumped a file whose name starts with callgrind.out. and ends with the PID (process ID) of the httpd process that was started.

You can just safely assume it's a random number, if you are not familiar with process IDs. Actually, if you're not familiar with process IDs, this article is way too advanced for you. Oh well.

You can open that file up in an editor if you like.

It won't make much sense, really, but it's kind of cool to see it.

Don't change anything in that file, for goodness' sake. We've worked very hard to build it.

Since you can't read the silly thing, let's install some software that can: kCacheGrind:

yum install kdesdk

Don't ask me how you're supposed to figure out that kCacheGrind comes as a part of the KDE SDK tool-suite. I used Google. A lot.

Now you can run that tool to see how phpinfo() spends all its time:

kcachegrind callgrind.out.*

You have to do this in the same shell/directory where you did the valgrind, just in case you didn't figure that out on your own.

You ought to see something not unlike this:




Whoa!
What's all that, you ask?

Focus on the big splotches of color. The bigger the rectangle, the more time PHP spent in that function.

As you can see, PHP spent a lot of time calling various _dl_* functions, presumably loading up code binaries, or double-secret internal functions you should ignore for now.

But you can also see a big chunk for good old strlen

Now, obviously, you are not likely to jump in and edit PHP's strlen function and make it faster, but the point is that you now know where PHP is spending all its time in your PHP script.

Of course, with phpinfo() as the only function call, that's pretty boring.

Install your favorite framework or large application project (Drupal, WordPress, Zend Framework) or your own home-brew MVC, and load a few test pages.

Heck, just copy a whole site that you think is "too slow" onto this box in /var/www/html/ and fire it up and find out!

But wait, you say, I'm just a PHP developer. I'm not going to be re-writing PHP internals!

Well, guess what?

This genius Derick Rethans has a tool called Xdebug that you can use to do this same thing for your PHP code.

You'll have to compile it from source, because there are no packages for Xdebug yet.

Or use PECL, if you've managed to install that.

I have not installed PECL yet, so I just downloaded the "source" and did the usual:

./configure
make
make install


Oh yeah. Something should be said here about libedit if you want readline-like abilities in the debugclient.

I pretty much said "Meh." and went on with life, once I realized that I'd have to fight RHEL for another hour or two to get that.

If you need that feature in our PHP debugger, you're on your own. Sorry.

Now, add a few settings anywhere you like in /etc/php.ini:

; Added xdebug settings
zend_extension="/root/installers/xdebug-2.1.2/modules/xdebug.so"

; Disallow usage of @ to suppress errors
xdebug.scream = 1

; Turn on profiling
xdebug.profiler_enable = 1

; Dump the enormous files here
xdebug.profiler_output_dir = "/tmp/xdebug"


And, so apache can dump the files there a couple shell commands:

mkdir /tmp/xdebug
chown apache /tmp/xdebug
chmod 775 /tmp/xdebug


You have to restart apache for the php.ini changes to take effect:

/etc/init.d/httpd restart

Oh. It wasn't started because we killed it and ran it with valgrind before.
Oh well.

Anyway, reload your phpinfo() page (or other) from http://localhost/index.php

Now go look in your /tmp/xdebug directory for some callgrind.out.* files, and load them up into kCacheGrind.

Voila! You now know where your PHP code is spending all its time, just by looking for the "big" rectangles.

kCacheGrind has a zillion options and tools to slice and dice your data, and different ways to show it off.

It also can export the pretty pictures, if you have graphviz and kghostview.

GraphViz is pretty easy. Snag the repo definitions from GraphViz RHEL Repo and put it in /etc/yum.repos.d/graphviz-rhel.repo file.


yum install graphviz


Finding kGhostView was quite frustrating. It seems like RHEL is intent on stamping out KDE, and kGhostView in particular since 2002 or so. I found all kinds of rpms from that era, and not much after.

So we turn to our friends at EPEL, and use their rpm which installs and maintains their repo definitions into /etc/yum.repos.d. Did you follow that?

Just download the appropriate rpm from here:
EPEL rpm to expand yum

Now at this point, with EPEL installed, I pretty much opened myself up to straying very far from RHEL, but all I want is kghostview so kCacheGrind can export pretty images:

yum install kghostview


Woof.


I'm pondering if I should now uninstall the EPEL stuff, just to stay on the straight and narrow, and let kghostview lag, or try to remember to only use EPEL in dire circumstances...

Well, anyway, you probably don't care about that bit.

You now should have a very powerful tool suite to really find the bottlenecks in your PHP code, instead of trying some Voodoo Analysis to "optimize" things like changing all "" to '' or using a single (un-indexed) query instead of two (indexed) queries, because running more queries is slower, or...

I know, I promised not to go there in the first paragraph, so I'll shut up now.