I don't usually like writing articles, because inevitably someone you know is going to stumble upon your work and realise that you don't know as much they thought you did With regards to mod_rewrite andmod_alias, however, I am regularily underwhelmed by the available examples floating out there whenever I go looking.
Sure the official docs at apache.org are okay, but they really only go through all the available options and don't give very many real world examples. I hope to offer some simple tips and tricks which will solve many common problems.
.htaccess And You
If your website is hosted via Apache, you can edit your .htaccess file(s) and the server is set up to read them, I have one big piece of advice for you: USE YOUR .HTACCESS FILE! Your .htaccess file provides you with quick access to make extremely valuable site-wide context changes to the way your webserver behaves. It can provide for HTTP redirects (3xx), mark documents as gone (HTTP 410), pre-set PHP environment variables, add custom MIME-types, block IPs, adjust headers, set which pages to display on a server error, and much, much, much more!
If you aren't using it yet because you fear it's too complicated, just start with something simple and also very useful: setting a 404 error document. Create a page to display to users if they request a page that doesn't exist in your website. I've called mine "notfound.html".
ErrorDocument 404 /notfound.html
Place the saved .htaccess file in the root public HTML folder. Now instead of the plain "File Not Found" page, your users will be given a much nicer page to look at, ideally which contains navigation links back to the rest of your site.
The Tricks
Okay, now that we've gotten the beginner stuff out of the way, let's move on to the tricks. There is so much you can do with mod_rewrite and mod_alias that it would be impossible to detail all of the sneaky things they can do. Basically I'm going to describe a few tricks which I've picked up over the years, tricks that have gotten me out of some tough situations.
Let's start with the basics. If the functionality of each of these modules were mapped out in area, it would look something like this red and blue image. Yesh, that's right. For all practical purposes mod_rewrite can do everything that mod_alias can do, plus a great deal more. So why use mod_alias at all?
Well, mostly because it's "cleaner". mod_rewrite can get pretty complicated sometimes, and if an action can be performed using mod_alias, it can usually be done in a lot fewer bytes, CPU cycles and headaches. Essentially, if you're doing a "rewrite" which doesn't have any complex conditions attached to it, youshould be using mod_alias. Conversely, if you want to redirect requests to files and query strings which you don't want displayed in the browser's address bar, you should be using mod_rewrite. In my own crazy world, I consider mod_alias the brightly dressed traffic cop which you can't miss, while mod_rewrite is the shady character you often miss because he does most of his work behind the scenes.
Moving forward, I'll assume that you know which directives are members of each module. See the links at the top of this post to go to the Apache documentation.
Order of Execution
mod_rewrite rules get executed before mod_alias rules. This is important to note because this is the case regardless of the order in which you place the directives in your .htaccess file. It may not seem like a big deal at first, but it means you may end up performing simple redirects using mod_rewrite simply because you want it to happen before a mod_rewrite directive which happens later.
For example, say you have a PHP file which deals with products that uses the following simple mod_rewrite directive:
RewriteCond %{DOCUMENT_ROOT}/products/$1 !-f RewriteRule ^products/(.+)$ /products/index.php?$1 [L,QSA]
The rules above will check to see if a file doesn't exist (!-f) at the requested URI within the /products directory, and if the file doesn't exist, it will send the request as a query string to the index.php file. This makes URIs like http://www.example.com/products/COG1 into a "real" request forhttp://www.example.com/products/index.php?COG1 which the client never sees.
However, you also want to redirect an obsolete product to a non-product page. So you add the following:
Redirect 301 /products/Widget http://www.example.com/new-widgets
However, this doesn't work because the "products/Widget" REQUEST_URI gets caught by the mod_rewrite rule before it ever gets to the Redirect directive. This happens even if the Redirect directive comesbefore the RewriteRule directive in the .htaccess file! There are a number of ways you can overcome this, but the simplest is to move the Redirect above the RewriteRule by moving it from mod_alias to mod_rewrite:
RewriteRule ^products/Widget /new-widgets [R=301,L] RewriteCond %{DOCUMENT_ROOT}/products/$1 !-f RewriteRule ^products/(.+)$ /products/index.php?$1 [L,QSA]
Now /products/Widget will be correctly redirected with a 301 Moved Permanently HTTP status code.
Other Quirks
Both mod_rewrite and mod_alias have a few quirks which keeps them from being completely compatible. For instance, mod_alias patterns will match against a REQUEST_URI which begins with the first forward slash (/), while mod_rewrite patterns exclude this slash. This is because mod_alias always works from the root public HTML directory, even if the .htaccess file is placed in child directories. Thus, if the .htaccess file is in the root public HTML directory, the following two rules are equivalent:
RewriteRule ^directory/(.*)$ /target/$1 [R=301,L] RedirectMatch 301 ^/directory/(.*)$ /target/$1
Another quirk to keep in mind is that a plain Redirect directive requires that you give a complete URI as the redirect target, while using RedirectMatch allows you to specify a relative URI as the target (as above). The mod_alias docs say a relative URI is acceptable for the Redirect directive, but it doesn't work on any server that I've tried; it throws an Internal Server Error instead.
RedirectMatch 301 ^/directory/(.*)$ /target/$1 # Valid Redirect 301 /directory/ /target/ # Not Valid Redirect 301 /directory/ http://www.example.com/target/ # Valid
Domain Name Consolidation
If you own a domain name, chances are that both the www. and non-www. version both lead to your website. In order to prevent links of both types from getting spread around in search engines and the rest of the internet, you should decide upon the one you'd like to be the default. It doesn't really matter which one you choose, but once you choose, you shouldn't switch.
The mod_rewrite code below will redirect all requests for http://example.com tohttp://www.example.com
RewriteCond %{HTTP_HOST} ^example\.com$ RewriteRule ^(.*) http://www.example.com/$1 [R=301,L]
Redirects - Not Just 3xx
Just because it's called a "Redirect" directive, doesn't mean getting sent to a new URI is obligatory. You can also do 4xx level responses with this directive which allows you to leave out the target URI entirely.
Here are a couple URIs which are Gone:
Redirect 410 /old/index.html Redirect 410 /not-here.html
Or we can block some pesky bots with an obscure response and keep them from generating 404s. HTTP 412 is Precondition Failed.
RedirectMatch 412 ^/_vti_bin/
Pseudo-Subdomains
Normal subdomains activated through most hosts act as entirely separate domains. Files are stored in different public HTML folders and so any content shared by both sites needs to be duplicated, or the HTML files of one need to refer to files on the other using full URIs. Usually the benefits of such a system outweigh the hassles, but perhaps you want a tighter subdomain setup.
If your DNS is set up to direct undefined hostnames to the basic www host, then mod_rewrite can help you mimic as many subdomains as you wish, with one amazing benefit: File fallback. Consider the following set of mod_rewrite directives:
RewriteCond %{HTTP_HOST} !^subdomain\. RewriteRule ^.*$ - [S=2] RewriteCond %{DOCUMENT_ROOT}/subdomain/$1 -f RewriteRule ^(.*)$ /subdomain/$1 [L] RewriteRule ^$ /subdomain/index.php [L]
Using this setup, whenever a client accesses your website using the domain name subdomain.example.com, the server will first check to see if the file being requested exists in the /subdomain directory. Now here is the cool bit: If the file doesn't exist there, it will fall back to the one in the regular directory!
This sort of setup is great for offering, say, multiple translations or versions of your website. To create the second website, you would just fill the /subdomain directory using the same directory structure as the main site. And if a certain file hasn't yet been translated, it will silently default to the file from the main website. In the example above, "index.php" is the filename of your directory index.
Regexp Magic
The mod_rewrite documentation at apache.org does a pretty good job at giving a basic outline of regular expressions. However, there are some things it skips over.
The documentation mentions subpattern matches which can be referenced in the replacement string in the format $N, where N is the number of the subpattern match. eg. $1, $2, etc. But did you know that you can also refer to subpattern matches within the pattern itself? While it's a documented feature of Perl Compatible Regular Expressions (PCRE), it isn't mentioned in the mod_rewrite documentation.
For example, perhaps because of some other complex redirection rules, some broken browsers or faulty robots are doubling up on directories, making requests to URIs like /docs/docs/file.doc and /products/products/Widget which are filling up your logs with 404s. You can catch all of these doubled directories with a single Redirect:
RedirectMatch 301 ^/([^/]+)/\1/(.*)$ /$1/$2
The \1 means: whatever got matched by the first parenthesized subpattern. You can see here that the pattern will thus match a directory, followed by the same directory name, followed by the rest of the path. Voila! No more doubled directories!
No examination of mod_rewrite and mod_alias pattern-making is complete without a review of regular expressions direct from Perl.org. Check it out!
Preferring Files Over Directories
One of the most complex problems I've encountered while developing sounded deceptively simple: In it's default setup, Apache will look for a directory match first, and then look for a file match for an incoming REQUEST_URI. However, what if you wanted Apache to prefer the file over the directory?
Generally, this only applies if you have some form of extensionless system active. So you could have a file named image.html and also a directory named /image within the same directory; where then would a request like http://www.example.com/image end up? As explained above, by default you will get a directory listing for the /image directory, rather than the contents of the image.html file.
So how to prefer files over directories? The answer is complicated because it requires overriding not just one, but several default Apache behaviours. The first thing you will need to disable is the mod_dir DirectorySlash directive:
DirectorySlash Off
When DirectorySlash is On, REQUEST_URIs which are determined to match directories automatically get a slash added to the end. We definitely don't want this; we want http://www.example.com/image to stay as it is and give us the image.html file.
Now for the mod_rewrite directives. Hold on to your butts . . .
RewriteCond %{DOCUMENT_ROOT}/$1.html -f RewriteRule ^(.*?)(\.html)?/$ /$1 [R=301,L] RewriteCond %{DOCUMENT_ROOT}/$1 -d RewriteCond %{DOCUMENT_ROOT}/$1.html !-f RewriteRule ^(.*?[^/])$ /$1/ [R=301,L] RewriteCond %{DOCUMENT_ROOT}/$1.html -f RewriteRule ^(.*)$ /$1.html [L,NC]
Jeepers! What the heck is all that? Well, using the rules above we are bypassing the normal MultiViewsmethod to instead implement a strictly mod_rewrite solution. The above code assumes that your HTML files have the .html extension. If your content files all end in .php or .xhtml, just change the extensions in the code.
The first two lines redirect URIs ending with a slash, but matching a file with an .html extention, to the extensionless version. So it would take a URI like http://www.example.com/hello/ and forward you tohttp://www.example.com/hello That's step #1! An unfortunate consequence of this is that it overrides the directory index of a directory if a sibling file exists with the same name as the directory. So even ifhttp://www.example.com/hello/index.html exists, you cannot type http://www.example.com/hello/to get to it. Rather, the URI http://www.example.com/hello/index is required to load this file. All other files in the /hello directory are unaffected.
The next three lines take URIs which do NOT end in a slash, check whether it matches a directory (-d) and at the same time does NOT match a file of the same name with an .html extension (!-f). If all these tests pass, the rule forwards you to a URI with a slash at the end. This is pretty much a "smarter" replacement for the DirectorySlash directive which we disabled earlier. That's step #2.
The final two lines actually hook up URIs with the correct file. It checks whether the currently requested file, with an .html extension tacked onto the end, matches an actual file, and if so, we are silentlyforwarded to the contents. Silently, meaning that it is not a real HTTP redirect so the client's address bar will still display the clean URI: http://www.example.com/hello
Of course, there are other ways to accomplish an extensionless system which prefers files over directories, this just happens to be the one I use. There's always more than one way to do it, and I welcome suggestions which would make the code above better
The End?
Well, that's all I could think of for now. If you found any errors in the examples above, please let me know. Hopefully you learned a few things, and if not, you are pretty smart!
No comments:
Post a Comment