Wherein I occasionally rant on various topics including, but not limited to, PHP, Music, and whatever other Topics I find interesting at the moment, including my brain tumor surgery of August 2010.

Thursday, June 22, 2006

PHP Downloads, Content-Disposition, Content-type, and other arcana.

Every damn day, some other poor PHP newbie goes off and finds "advice" in Google about how to force browsers to download things, and how to get the filename they want in the "Save As..." prompt box.

Invariably, these newbies are suckered in by Bad Advicetm from people who clearly do not read the HTTP specs.

Now I'm not claiming to have read all the HTTP specs, much less memorized them, but did fight through this battle in the days of version 3 and 4 browsers, and it still seems to be tripping people up, while my solution has been working for me since nineteen-ninety-mumble.

So I'm gonna let you in on a few little secrets.

Okay, one of the secrets is widely published in the RFCs, and the other is hard-won experience about how the 1995 johnny-come-lately Microsoft made-up Content-disposition header isn't really widely-supported very well, and how to make 100% certain that your downloads always end up prompting the user for the right filename with a crude but effective hack that leaves the brower with no other reasonable choice.

First, if you Google for this topic, you're going to find a lot of people suggesting a lot of different MIME-types to use for the "Content-type" header:

application/octet-stream
application/download
application/force-download
.
.
.


Note that this is to force a download -- If you actually want a browser to display (or attempt to display) some content, the Content-type: should always be the correct Content-type for that type of document. E.g. text/html for HTML, image/gif for a GIF, image/jpeg for a JPEG (image/jpg will not work on some browsers), image/png for a PNG, application/pdf for a PDF, and so on.

If you are ever in doubt about the correct MIME-type for browser display, find a static URL that works, preferably on the corporate site driving that technology, and find out what Content-type: is sent by their site for their sample content.

For a download, application/octet-stream works because it's a part of the HTTP specification, and has been part of the HTTP spec, from the very beginning of HTTP specifications.

The others "work" only because they are made-up content-types, and the browser currently has no idea what to do with them.

Some equally valid Content-type would be:
asdf/asdf
abc/abc
you-can-put/anything-you-want-here
microsoft/sucks
asdfswetrkhkhkvnknsdbknilghwerthiehl/wilerywnfaksvnklgndiglkghadlgha


Only problem is, tomorrow Microsoft can choose, at their discretion, that "application/download", or any of the other made-up MIME-types means "Put it in the My documents directory", because MS knows much better than you that that is what all their users really want, and your download doesn't do what you want any more.

But they cannot change the meaning of application/octet-stream, because that is specifically reserved, in the HTTP specification, for force a download

So, all of the above boils down to the following question:

Do you want to use a MIME type that happens, by sheer coincidence, to not be "taken" yet and will work today, but tomorrow might be re-defined?

Or would you rather use the documented feature that appliction/octet-stream will always work?

For anybody not well-versed in Tech-Talk the correct answer is the latter, and definitely not the former.

So to force the browser to download a document, the only correct solution is:
Content-type: application/octet-stream


Anything else is a game of Russian Roulette.

With Microsoft pulling the trigger.


Now we come to the somewhat more complicated issue of getting the "Save As..." window to provide the filename you like as a default.

Let's assume, for our purposes, that you want this filename:
iwant.xyz


In an ideal world, the "Content-disposition: ... ;filename=iwant.xyz" would work perfectly for this.

... represents some sort of MIME-type, which is largely irrelevant, for the purposes of this article about forced downloads.

Unfortunately, with some versions of some browsers, this Content-disposition: simply will not work.

In fact, no matter what combinations of headers you try to use, there is some minute version, such as x.y.z.37, that doesn't work even though x.y.z.36 and x.y.z.38 do work

This is complicated by the presence or absence, and/or changes to the Content-type: and the "..." part of Content-disposition: that we're ignoring.

To document exactly which versions of which browsers do/don't work for which combination of headers, mime-types and URLs is well beyond the scope of a blog post. Perhaps a Ph.D. Thesis would be more appropriate, if somebody is desparate for a Thesis Topic and has a lot of time to spare.

And for the love of insert diety of choice here do not ask me to tell you which browser/version won't work with your particular solution. I have neither the time nor the inclination to do your browser-testing QA for you. I don't even like doing my own QA, much less yours.

I can only guarantee you that if you test on every minor version of every browser ever released, you will find one that does not work.

But I do have a solution for you:
Provide a URL which the browser cannot possibly mess up.

For example, this URL will mess up some browsers:
http://example.com/download.php?filename=iwant.xyz

This URL, however, the browser, cannot possibly mess up:
http://example.com/download/iwant.xyz
because it's too damn simple to mess up.
K.I.S.S. priniciple is the watchword here.

Don't give the browser any opportunity to screw up. Because if you do, some browser somewhere will screw up.

"But wait!", you say, "I can't do that! I need my PHP script to do the download work"

Yes, you can.

Follow the bouncing ball:

First, 'download' above may look like a directory, but it's not. It's a PHP script. It just doesn't happen to have .php on the end of it.

And iwant.xyz doesn't have to be in any particular location just because it looks like a boring static URL component.

There are a variety of ways to make this work. Apache's mod_rewrite springs to mind for Apache experts, and there are many articles online telling you how to do PHP mod_rewrite.

But, truth to tell, mod_rewrite is a real PITA to mess with. And Apache/PHP has a much easier technique available which I'll detail below.

So, there are two tasks here for this URL:
http://example.com/download/iwant.xyz

The first is to somehow get 'download' to be a PHP script, even without '.php' on the end.
And the second is to somehow make 'iwant.xyz' from the URL available to PHP.

Fortunately, the mechanics of these are very easy tasks.

It seems to be difficult for newbies to wrap their brains around the concepts, but the actual mechanics are trivial, and I'm hoping this How-To will ameliorate the difficulty of the concepts.

Let us begin by assuming your not quite working download.php PHP script looks something like this:

<?php
//register_globals is off, of course
$filename = $_GET['filename'];

//Crude cleansing to avoid ../../etc/passwd hacks
$filename = basename($filename);

//In an ideal world, you would have a specific range of legal values for $filename
//And your cleansing would test positive only for valid input
//The following line is far too restrictive in anything but this sample application
//But it's definitely the Right Way (tm) for THIS sample application
//Security can't be bought off-the-rack. It's a custom job like this
if ($filename != 'iwant.xyz') die("Did you really think I wouldn't add filename validation here?");

//I can virtually guarantee that the next line is not correct. Fix it.
//I personally would recommend that it NOT be in your webtree,
//so Bad Guys (tm) cannot bypass your application and just get it direct.
//If your webhost does not provide a non-web-tree directory, find a new host.
//This should be the complete full path to the "real" iwant.xyz file.
$basename = '/some/path/to/the/real/files/';

//Compose the actual full file path:
//If you didn't put / at the end of $basename, add it here
//Or do some fancy footwork to be sure you have the proper number of '/'s you need
//Or not, as Un*x systems ignore bogus extra '/' in a pathname anyway.
$fullname = "$basename$filename";

//For larger files, a decent browser will provide a progress meter, if you do this:
$filesize = filesize($fullname);
header("Content-length: $filesize");

//As discussed above, the only Documented Feature,
//sure-fire guaranteed way to force a download every time is:
header("Content-type: application/octet-stream");

//Now just read the file and spit it out:
readfile($fullname);
//For large files, an fopen/fread loop using feof may be more appropriate
?>


Now, instead of the usual 'download.php' you might expect, name this script 'download' without the '.php'

In order to convince Apache that this script really is a PHP script, even without the .php on it, create a file named '.htaccess' in the same directory as 'download' (or a 'higher' directory) and put this in it:

<Files download>
ForceType application/x-httpd-php
</Files>


The above three lines of magic force Apache to think of 'download' as a PHP script, even though .php is not part of the script name.

This assumes that your webserver has been configured with .htaccess "on" in httpd.conf. If that's not the case, then you would probably want to put those three lines directly in httpd.conf, or in a file that httpd.conf Includes.

If your Apache webserver host provides neither httpd.conf nor .htaccess to you, the I feel truly sorry for you, but cannot help you, other than to suggest finding a better host.

Your URL would then look something like:
http://example.com/download?filename=iwant.xyz

You should go ahead and build this example application now, and then we can move on to our second task of getting rid of the ?filename= part.

Here is a sample application using the above code:
http://l-i-e.com/blogger/download.php?filename=iwant.xyz

Note that you will, depending on your browser make and model, probably be prompted to "Save As..." with the filename 'download.php'

We'll be fixing that in our next task, so just change the name to 'iwant.xyz' by hand when prompted to download.

But if you can find a browser that does not treat that file as a download, I'll send you a Cookie.

Note that you can configure some browsers to just auto-save all downloads in some directory or, blech, on your desktop. That's a user-configuration choice which nothing in the world is going to "fix". Sorry. Educate the user, or live with their freedom of choice, whichever way you want to look at it.

But the browser itself is still treating the output as a 'download' even if it has been [mis-]configured to just dump the file in some random directory.

Now, on to the task of getting the URL to end in /iwant.xyz so that the browser is "fooled" into thinking it's a static URL and it will use 'iwant.xyz' as the default filename for the download window.

As you can see, your script 'download' is going to have some extra stuff at the end of it.

Apache and PHP collaborate to mostly ignore anything extra tacked onto the end of a URL, except for one crucial input they provide:
$_SERVER['PATH_INFO']


This variable is set by Apache/PHP to contain everything after your script name that is in the URL, no matter what is there.

Now, because your real application might need more input than just 'iwant.xyz' I'm going to go above and beyond here, and provide an include file that will give a lot of flexibility.

Here are some URLs the normal way, and some done my recommended way compared side-by-side:














NormalRecommended
http://example.com/download.php?filename=iwant.xyz http://example.com/download/iwant.xyz
http://example.com/download?filename=subdirectory/iwant.xyz http://example.com/download/subdirectory/iwant.xyz
http://example.com/download?page=42&line=20&filename=iwant.xyz http://example.com/download/page=42/line=20/iwant.xyz


Keep in mind on that last one, that the browser cannot know that you don't have directories named 'page=42' and 'line=20' no matter how odd that may seem for directory names.

Those are perfectly valid directory names, and the browser has to assume that's what you have.

Only you and I will know that 'download' isn't a directory but a PHP script, and those 'extra' bits are really just inputs to this PHP include:

<?php
//Consider a URL such as:
// http://example.com/scriptname/var1=val1/var2=val2/path/to/filename.xyz
//Transform it into:
// $PATH = '/path/to/filename.xyz'
// $PATH_VARS['var1'] = 'val1';
// $PATH_VARS['var2'] = 'val2';
$PATH = '';
$parts = explode('/', $_SERVER['PATH_INFO']);
foreach ($parts as $part){
$pieces = explode('=', $part);
switch(count($pieces)){
case 1: /* tack it on as part of a pathname */
//Also ignore the leading '/' of PATH_INFO which turns into an empty '' from explode()
if ($pieces[0] !== '') $PATH .= "/" . $pieces[0];
break;
default: /* Set up something like $_GET only with $PATH_VARS */
$var = $pieces[0];
// value might have = within it...
unset($pieces[0]);
$val = implode('=', $pieces);
$PATH_VARS[$var] = $val;
break;
}
}
?>


I've commented the above script heavily, and all it does is transform the PATH_INFO that Apache and PHP provide into a couple convenient variables:
$PATH_VARS will contain any /var=val/ in the URL as $PATH_VARS['var'] = 'val';
$PATH will contain anything else in the path as '/subdir1/subdir2/filename.xyz';

Save that script above as 'pathinfo.inc' and change the top of your 'download' script from $filename = $_GET['filename'] into this:
require 'pathinfo.inc';
$filename = $PATH_VARS['filename'];


Now, you can surf to a URL like this:
http://l-i-e.com/blogger/download/iwant.xyz
and get a download windows with the only reasonable choice for a default filename to "Save As..." that a browser could possibly infer from that static-looking URL: 'iwant.xyz'

It would be nice if this was a Documented Feature or if something like Content-disposition actually worked in all the minor versions of all the browsers ever released.

But, in this case, consider what else a browser could possibly do with the download window it must provide.

If you really think about this, I believe you'll come to the same conclusion I did, many years ago: This is a hack, but a reasonably safe hack, because what else can a browser do with such a simple URL, given that it must prompt the user for a file download (or auto-save the download by user choice) to remain compliant with the HTTP Spec.

Here is the final source for our download script:
http://l-i-e.com/blogger/download.phps

And the pathinfo.inc file:
http://l-i-e.com/blogger/pathinfo.phps

There are a few caveats worth mentioning here:

Unlike $_GET, $PATH_VARS cannot be made into a "super-global" so you'll have to declare it global within your functions/methods.

Actually, technically, you could use PHP's RunKit extension to force $PATH_VARS to be a "super-global" but if you've got RunKit installed on your server, you probably already know everything in this article anyway.

If you don't know what RunKit is, you should just take my word for it that you don't want it installed, but if you need convincing, let me just point out that the purpose of RunKit is to be able to re-define something like:

if ($whatever) echo $something;

so that the 'if' and the 'echo' don't do what you expect them to do anymore.

I.e., RunKit lets you re-define the actual PHP language on-the-fly, for developing a new version of the PHP language.

Let me also point out, in case it's not blatantly obvious, that savvy users can still put any damn thing they want into the URL in attempts to break your script, take over your server, and otherwise cause you much grief.

This URL munging should not be considered primarily as a "Security Measure" though there may, or may not, be some relative increased security in that finding a ? in a URL and then trying variants is probably a very common Bad Guy technique, but cramming more things onto the end of a static URL generally doesn't do anything at all, so most Bad Guys probably don't do a lot of that.

This falls into "Security through Obscurity" though, which are generally very weak security measures, and only useful, if at all, when layered in with other, more robust, Security Measures.

Or, to make a long story short, you must still validate and cleanse any data coming from $PATH and $PATH_VARS, exactly as you would for $_GET.

Required Reading:
http://phpsec.org/


I aleady know that at least one other PHP Developer thinks I'm daft to put /var=val/ into the URL, and that I should just use 'positional' elements.

Unfortunately, I do not find that very convenient, as some of the elements in my scripts are optional, so the URL would end up needing too many '/////' in it and my eyes are too old and worn-out to attempt to count those correctly.


I can also safely predict some bloggers will insist that "Content-disposition:" works just fine in all browsers, or maybe they'll be smart and qualify it as "all modern browsers"

My only possible responses to that are:

  1. You haven't tested enough minor release versions

  2. I believe backwards-compatible legacy support for ancient browsers is important



If you do not like this particular solution, just don't use it.

I happen to believe, based on my experience fixing far too many bug reports from iconoclastic users of niche browsers you may have never even heard of, that it's the only correct solution to browser insanity, paricularly if you use PHP to output dynamic rich media such as Images (GIF, JPEG, PNG), PDF, FDF, Flash/Ming, and so on, as I have done.

The pathinfo.inc file above works wonderfully for a URL such as:
http://example.com/thumbnail/max_width=100/photographer_id=7/artist_id=15/rockstar.jpg

It also works for the PDF URL embedded in an FDF which tend to drive Netscape/IE crazy if you start adding dynamic elements.

This rant was actually referenced by none other than PHP Security Expert Chris Shiflett in The Adobe PDF XSS Vulnerability