Search This Blog

Thursday, November 24, 2011

Comparing Mobile Web (HTML5) Frameworks: Sencha Touch, jQuery Mobile, jQTouch, Titanium

Comparing Mobile Web (HTML5) Frameworks: Sencha Touch, jQuery Mobile, jQTouch, Titanium:

It’s been an exciting year for the mobile Web. Adoption of HTML5 and CSS3, improved performance in mobile browsers, and an explosion of mobile app frameworks mean it’s more feasible than ever to create rich, interactive Web experiences for mobile devices. Using a wrapper like PhoneGap, you can distribute them via the native app stores for iPhone, iPad, and Android —targeting multiple platforms with a single codebase.


Or can you?


I needed a platform for Pints — a mobile app that answers answer the question, “Which beer should I order?” As someone who works in Web technologies on a daily basis I saw HTML5 & friends as an alluring option.


Pints isn’t complicated: a home screen, a few lists screens, a few forms. Its greatest complexity lies at the data level: as an iPhone app destined for San Francisco bars it can’t possibly rely on an Internet connection, so it has to keep a local copy of the beer database and sync it with the server when that’s available. HTML5 has the necessary building blocks in the form of several offline storage options; it’s just a question of writing the synchronization code.


Mobile Web developers have a plethora of frameworks to do the heavy lifting for them: animated transitions, toolbars, buttons, list views, even offline storage. Most of these are new and the landscape is shifting rapidly. I started Pints in jQTouch, then migrated to jQuery Mobile, and finally rewrote the whole app (now in private beta) in Sencha Touch. Along the way I also investigated Appcelerator’s Titanium Mobile. Here’s what I found:


jQTouch


jQTouch is easy to use and relatively well-documented. It’s featured in the excellent Building iPhone Apps with HTML, CSS, and JavaScript. jQTouch takes a progressive-enhancement approach, building an iPhone-like experience on top of your appropriately-constructed HTML. It’s simple, providing a basic set of widgets and animations and just enough programmatic control to permit more dynamic behavior.


But even in my simple test app there were performance issues. Page transitions can be jumpy or missing, and there are periodic delays in responding to tap events. And while the project is technically active, the original author has moved on and development seems to have slowed.


jQTouch is available under the permissive MIT License, one of my favorite open source licenses.


jQuery Mobile


jQuery Mobile is the new kid on the block. Announced in August 2010, it’s quickly progressed to a very functional Alpha 2. It takes a similar – but more standards-compliant – approach to jQTouch and feels very much like that framework’s successor, with a broader array of UI controls and styles.


jQuery Mobile’s performance is variable (though better than that of jQTouch), particularly in responding to tap events rendering animations. It also lacks a small number of key programmatic hooks that would permit easy creation of more dynamic apps. For instance, there’s an event that triggers when a page is about to load (i.e. slide into view) but no way to tell the associated handler code what UI element resulted in the page switch, or to pass additional information to that handler. I was able to create workarounds but hope that future versions will take a cue from jQTouch and build out this functionality a little more.


jQuery Mobile’s documentation is a little scattered but improving; I’m hopeful that it will become as robust as that of the core jQuery library. (Note that jQuery Mobile is really a mobile counterpart for jQuery UI, not for jQuery itself, on which it builds.)


jQuery Mobile is available under either the MIT or the GPL2 license.


Sencha Touch


Sencha Touch is the mobile counterpart to the Ext JS framework. Its approach differs significantly from jQTouch and jQuery Mobile: instead of enhancing preexisting HTML, it generates its own DOM based on objects created in JavaScript. As such, working with Sencha feels a little less “webby” and a little more like building apps in other technologies like Java or Flex. (It’s also a bit more like YUI than like jQuery.) I personally prefer the progressive enhancement approach, but it really is a matter of preference.


Sencha is far more extensive than its competitors, with a vast array of UI components, explicit iPad support, storage and data binding facilities using JSON and HTML5 offline storage, and more. (It’s very cool to manipulate app data in one of Sencha’s data structures and watch the corresponding list update in real time.) It’s also the only Web framework I’ve seen with built-in support for objects that stay put (like a toolbar) while others scroll (like a list).


For all that apparent extra weight, Sencha performed noticeably better and more reliably than either jQTouch or jQuery Mobile in my tests, with the exception of initial load time.


When working with a library or framework, it’s usually counterproductive to “fight the framework” and do things your own way. Given how extensive Sencha Touch is, that means your app will probably end up doing just about everything the Sencha way. I’d originally used WebKit’s built-in SQLite database for offline storage but ultimately eliminated both complexity and bugs by moving that functionality into Sencha’s data stores.


The documentation, while extensive, has odd holes. Between those and the sheer size of the framework, I spent a lot of time fighting bugs that were difficult to trace and to understand. Responses to my questions in the developer forums were more frequent and helpful than with the other frameworks, but still ultimately insufficient. Sencha provides paid support starting at $300/year; I strongly considered purchasing it but oddly, their response to my sales support inquiries was incredibly underwhelming given my interest in sending them money.


Sencha Touch is available under the GPL3; under a somewhat confusing set of exceptions to the GPL that seem similar to the LGPL; or under a free commercial license.


Titanium Mobile


Much like Sencha Touch, Appcelerator’s Titanium Mobile allows you to write apps using a JavaScript API. But unlike Sencha, it compiles most of your code into a native iPhone or Android app. That means it isn’t really a Web framework, but a compatibility layer or compiler. (Note that its cousin Titanium Desktop is Web-based, allowing you to write HTML/JS applications that run inside a native wrapper on the desktop.)


So Titanium allows Web developers to produce high-performance, easily skinnable native apps using JavaScript and a little XML, i.e. without learning Objective-C or Cocoa Touch. My simple test app blew away the true Web frameworks in terms of performance, and wasn’t much harder to put together.


But that advantage is also its greatest disadvantage: you can only target the platforms Titanium supports, and you’re tied to their developer tools. As if to prove this point, my test app quickly got into a state where it wouldn’t launch on the iPhone. Titanium doesn’t include much of a debugger; Titanium projects can’t be run and debugged in XCode; and it ran fine in the simulator, leaving me with no real way to attack the problem.


Analysis


Rebuilding my app on three of these four frameworks was tedious but educational. I like jQTouch but have trouble believing it will evolve much from here. I’m rooting for jQuery Mobile for its simplicity and its very Web-centric approach to development…but it lacks a few key features and doesn’t perform as well as Sencha Touch.


It’s unfair to compare an Alpha 2 product with a 1.0 one, except in one respect: I need something now. Which brings me to Sencha Touch. I was initially impressed with its performance and breadth, but put off by its development style. As I’ve dug in, the holes in its documentation have been frustrating but the breadth has continued to impress me, and I’ve gotten more used to the coding style. The option for paid support is tempting, and I’d probably buy it if they’d answer my emails. But for now, Pints is a Sencha-based app.


Conclusion


I haven’t answered the big question: can a Web-based app really hold its own alongside native apps? And if so, are the challenges of getting it there worth the benefit of a single codebase?


Two weeks ago I was leaning toward no. Pints was in performance and bug hell, hanging for 10-15 seconds at a time; scrolling was choppy; and other animations were inconsistent.


But I’m hopeful again. In my next post I’ll discuss why, what I’ve learned, and my perspective on mobile Web apps today. I’ll also cover PhoneGap and other methods of distributing a Web app in a native wrapper. Stay tuned.

Monday, November 7, 2011

5 Tips to Cache Websites and Boost Speed

5 Tips to Cache Websites and Boost Speed:

Often when we think about speeding up and scaling, we focus on the application layer itself.  We look at the webserver tier, and database tier, and optimize the most resource intensive pages.

There's much more we can do to speed things up, if we only turn over the right stones.  Whether you're using WordPress or not, many of these principals can be applied.  However we'll use WordPress as our test case.

Test Your Website speed

There are web-based speed testing tools that will help with this step.  Take a look at Webpagetest , Pingdom tools and Google's Chrome plugin Pagetest which integrates right into your browser.  If you're using firefox, take a look at YSlow .

If you've already got the WordPress plugin W3 Total Cache installed, that also integrates with Google Speed Test through an API key.  That integrates right into your WordPress dashboard.

1. Reduce objects

The first thing you can do to improve page performance and response is to make the page simpler.  Fewer images, fewer posts, fewer plugins, fewer widgets and so forth all contribute to a speedier page.  Obviously you don't want to forfeit functionality, but if you are loading 20 posts, you might want to reduce them to five or ten.   If you are calling out to third party APIs to load badges, twitter counts or SKYPE buttons, consider each and its affect on webpage load times.

If you have access to the underlying HTML, your application should try to reduce the number of DOM objects that are created.  These are created as the HTML is parsed.  Simpler structure here means faster page load times.

2. Compress objects

Your page is full of objects.  Many of them such as images, can be compressed.  This saves server storage space, and is a smaller object to copy across the network to the end users browser.  The web page test will show you a list of objects that are good candidates for compression.

3. Employ page & object caching

An object cache will store name/value pairs in memory for your application.  Essentially the webserver tier will cache data that it reads from the database to avoid additional round trips and network overhead.   For most of us, we'll want to get memcache installed in our webserver tier.

For a page cache, take a look at Varnish.  This can also be installed by package manager.  A page cache is like a tiny hyper efficient webserver.  It can sit on the webserver itself or on it's own server if you have many webservers in your environment.

Lastly if you have not already, grab the W3 Total Cache plugin for WordPress.  This plugin integrates directly with these two types of caches.  Simply click the Performance tab, select General Settings, and scroll to the object cache and varnish sections.

4. Browser caching

Browser caching is a tricky one.  You might think this is totally in the hands of the end user.  As it turns out, however the objects your webserver sends back to clients specifies various caching information in the HTTP headers.

Since we're encouraging you to test things yourself, fire up your command line terminal, and use "curl" to take a look at the headers of a file on your webserver.  Here's an example:

$ curl -I http://www.iheavy.com/files/pros.gif
HTTP/1.1 200 OK
Date: Tue, 01 Nov 2011 04:58:55 GMT
Server: Apache/2.2.3 (CentOS)
Last-Modified: Wed, 24 Aug 2011 01:06:09 GMT
ETag: "1649b1fb-5562-4ab35eadfe240"
Accept-Ranges: bytes
Content-Length: 21858
Content-Type: image/gif

 

Now we'll go ahead and enable W3 Total Cache in wordpress, then rerun the same curl command again:

 

$ curl -I http://www.iheavy.com/files/pros.gif
HTTP/1.1 200 OK
Date: Tue, 01 Nov 2011 05:01:27 GMT
Server: Apache/2.2.3 (CentOS)
Last-Modified: Wed, 24 Aug 2011 01:06:09 GMT
ETag: "5562-4ab35eadfe240"
Accept-Ranges: bytes
Content-Length: 21858
Cache-Control: max-age=31536000, public, must-revalidate, proxy-revalidate
Expires: Wed, 31 Oct 2012 05:01:27 GMT
Vary: User-Agent
Pragma: public
X-Powered-By: W3 Total Cache/0.9.2.3
Content-Type: image/gif

The main line we're interested here is the cache-control line.  Notice that the object has an expiration of 31536000.  Turns out that is the number of seconds in one week.  This tells the browser to keep these objects in its own cache and not refetch from the webserver each time.  That's a tremendous speedup.

5. Employ a CDN

If you haven't heard of CDN before, it stands for content delivery network.  Dropbox, the filesharing service is essentially a CDN service with a handsome user interface built on top of it.  Akamai is another famous CDN solution, one of the most mature, and respected available.  If you're on Amazon you might look at CloudFront, their CDN solution.

What is a CDN exactly?  Consider your pressing need for some basic groceries.  Maybe you need juice, a sandwich, beer or cigarettes.   If you're walking around the city, you'll find a deli on every corner.  That's the quickest way to get those items, a short walk and a short line.  When you need bigger items, you might make a weekly trek to Whole Foods.  From the perspective of food and beverage manufacturers you certainly want to get your products into the delis.  They are your grocery distribution networks, if you will.

W3 Total Cache supports Amazon CloudFront, simply enter your private and secret keys, and save.  Then upload all of your content up to S3 and you'll be benefiting from a CDN in no time.

Test again

Now that you've had a chance to look at all these caching opportunities, and hopefully implemented many of them in WordPress, rerun your Google Speed Test.

google page test screenshot

Google Page Test - Score of 91

We were able to raise our score from the mid-70's to 91 after tweaking various pieces of the site.  We haven't even implemented image sprites yet.

Hopefully this time around you'll see some objects loaded via CDN, more objects compressed, and proper expirations on most if not all content on your site.

 

 

 

 

 

CDN & caching are closer than you think

Most internet sites can make great gains in faster response time by looking in the right places.  Application and server rearchitecting give you a lot, but usually involve more complex and ongoing effort by developers.  However the basic caching improvements we described above will make a big difference as well.  CDN and caching technology is a lot more within reach than you think it is.

 

 

 

 

Clutter

Clutter:

“The problem with this is there’s too much clutter.”


That’s what the legal secretary told me when we were studying her firm’s intranet home page. In fact, the page was pretty sparse in layout. The text was nicely laid out in a readable font, with different weights given to headings and body text. Overall, it was organized and readable. Cluttered just didn’t seem like the right word.


Yet, the legal secretary was quite firm on this. She wasn’t the only one. Half of the firm’s employees we interviewed used the word “clutter” to describe the page that looked anything but cluttered to us.


It might be tempting to rework this home page with more whitespace, more organization, more emphasis on the visual design. However, that wouldn’t have produced any better results.


Over the years, we’ve learned that users have a different meaning of “clutter” than the designers do. It’s not the visual design the users are reacting to. It’s the actual content.


The law firm employees were telling us that the page didn’t have links and resources they needed. The page was full of stuff — mostly things the firm’s marketing group wanted everyone to know — but very little of what was on the page helped the employees do their jobs. Everything they needed was on the intranet, and they knew it, but the home page didn’t lead them to it.


The page was cluttered.


Clutter is what happens when we fill a page with things the user doesn’t care about. Replace the useless stuff with links, copy, and content the users really want, and the page suddenly becomes uncluttered.


The definition of Clutter amongst Dictionary.com's Clutter

Dictionary.com’s definition of Clutter is found on a page, ironically, filled with clutter.


That’s exactly what we did at the law firm. Our design team uncovered those resources the users needed and organized the page to have exactly what the users needed to do their jobs well.


Those users loved the new page. In our evaluations, nobody used the word clutter. They used words like useful, helpful, and awesome.


Here’s the best part: We put the old and new pages side-by-side. The new page definitely had more text, less whitespace, and more dense information design. Yet, when we asked the users to tell us which one was more cluttered, they were unamimous: the old design was the cluttered design.


Are your users complaining about clutter? Maybe you should look at what they actually are seeing.

Back to Basics: Daylight Savings Time bugs strike again with SetLastModified

Back to Basics: Daylight Savings Time bugs strike again with SetLastModified:

CC BY-NC 2.0 Creative Commons Clock Photo via Flickr ©Thomas Hawk No matter how well you know a topic, or a codebase, it's never to late (or early) to get nailed by a latest bug a over a half-decade old.

DasBlog, the ASP.NET 2 blog engine that powers this blog, is done. It's not dead, but it's done. It's very stable. We had some commits last year, and I committed a bug fix in February, but it's really well understood and very baked. My blog hasn't been down for traffic spike reasons in literally years as DasBlog scales nicely on a single machine.

It was 10:51pm PDT (that's Pacific Daylight Time) and I was writing a blog post about the clocks in my house, given that PST (that's Pacific Standard Time) was switching over soon. I wrote it up in Windows Live Writer, posted it to my blog, then hit Hanselman.com to check it out.

Bam. 404.

What? 404? Nonsense. Refresh.

404.

*heart in chest* Have I been hacked? What's going on? OK, to the logs!

l2    time    2011-11-06T05:36:31    code    1    message    Error:System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
Parameter name: utcDate
at System.Web.HttpCacnhePolicy.UtcSetLastModified(DateTime utcDate)
at System.Web.HttpCachePolicy.SetLastModified(DateTime date)
at newtelligence.DasBlog.Web.Core.SiteUtilities.GetStatusNotModified(DateTime latest) in C:\dev\DasBlog\source\newtelligence.DasBlog.Web.Core\SiteUtilities.cs:line 1253
at newtelligence.DasBlog.Web.Core.SharedBasePage.NotModified(EntryCollection entryCollection) in C:\dev\DasBlog\source\newtelligence.DasBlog.Web.Core\SharedBasePage.cs:line 1182
at newtelligence.DasBlog.Web.Core.SharedBasePage.Page_Load(Object sender, EventArgs e) in C:\dev\DasBlog\source\newtelligence.DasBlog.Web.Core\SharedBasePage.cs:line 1213
at System.EventHandler.Invoke(Object sender, EventArgs e)
at System.Web.UI.Control.OnLoad(EventArgs e)
at System.Web.UI.Control.LoadRecursive()
at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) while processing http://www.hanselman.com/blog/default.aspx.


What's going on? Out of range? What's out of range. Ok, my site is down for the first time in years. I must have messed it up with my clock post. I'll delete that. OK, delete. Whew.



Refresh.



404.



WHAT?!?



Logs, same error, now the file is a meg and growing, as this messages is happening of hundreds of times a minute. OK, to the code!



UtcSetLastModified is used for setting cache-specific HTTP headers and for controlling the ASP.NET page output cache. It lets me tell HTTP that something hasn't been modified since a certain time. I've got a utility that figures out which post was the last modified or most recently had comments modified, then I tell the home page, then the browser, so everyone can decide if there is fresh content or not.



public DateTime GetLatestModifedEntryDateTime(IBlogDataService dataService, EntryCollection entries)
{
//figure out if send a 304 Not Modified or not...
return latest //the lastTime anything interesting happened.
}


In the BasePage we ask ourselves, can we avoid work and give a 304?



//Can we get away with an "if-not-modified" header?
if (SiteUtilities.GetStatusNotModified(SiteUtilities.GetLatestModifedEntryDateTime(dataService, entryCollection)))
{
//snip
}


However, note that I'm have to call SetLastModified though. Seems that UtcSetLastModified is private. (Why?) When I call SetLastModified it does this:



public void SetLastModified(DateTime date)
{
DateTime utcDate = DateTimeUtil.ConvertToUniversalTime(date);
this.UtcSetLastModified(utcDate);
}


Um, OK. Lame. So that means I have to work in local time. I retrieve dates and convert them ToLocalTime().



At this point, you might say, Oh, I get it, he's called ToLocalTime() too many times and double converted his times. That's what I thought. However, after .NET 2 that is possible.




The value returned by the conversion is a DateTime whose Kind property always returns Local. Consequently, a valid result is returned even if ToLocalTime is applied repeatedly to the same DateTime.




But. We originally wrote DasBlog in .NET 1.1 first and MOVED it to .NET 2 some years later. I suspect that I'm actually counting on some incorrect behavior deep in own our (Clemens Vasters and mine) TimeZone and Data Access code that worked with that latent incorrect behavior (overconverting DateTimes to local time) and now that's not happening. And hasn't been happening for four years.



Hopefully you can see where this is going.



It seems a comment came in around 5:36am GMT or 10:36pm PDT which is 1:36am EST. That become the new Last Modified Date. At some point we an hour was added in conversion as PDT wasn't PST yet but EDT was EST.



Your brain exploded yet? Hate Daylight Saving Time? Ya, me too.



Anyway, that DateTime became 2:36am EST rather than 1:36am. Problem is, 2:36am EST is/was the future as 6:46 GMT hadn't happened yet.



A sloppy 5 year old bug that has been happening for an hour each year that was likely always there but counted on 10 year old framework code that was fixed 7 years ago. Got Unit Tests for DST? I don't.



My server is in the future, but actually not as far in the future as it usually is. My server in on the East Coast and it was 1:51am. However, the reasons my posts sometimes look like they are from the future, is I store everything in the neutral UTC/GMT zone, so it was 5:51am the next day on my file system.



Moral of the story?



I need to confirm that my server is on GMT time and that none of my storage code is affected my Daylight Saving Time.



Phrased differently, don't use DateTime.Now for ANY date calculations or to store anything. Use DateTime.UTCNow and be aware that some methods will freak out if you send them future dates, as they should. Avoid doing ANYTHING in local time until that last second when you show the DateTime to the user.



In my case, in the nine minutes it took to debug this, it resolved itself. The future became the present and the future last modified DateTime became valid. Is there a bug? There sure it, at least, there is for an hour, once a year. Now the real question is, do I fix it and possibly break something that works the other 8759 hours in a year. Hm, that IS still four 9's of uptime. (Ya, I know I need to fix it.)




"My code has no bugs, it runs exactly as it was written." - Some famous programmer




Until next year. ;)



© 2011 Scott Hanselman. All rights reserved.





Stuff The Internet Says On Scalability For November 4, 2011

Stuff The Internet Says On Scalability For November 4, 2011:

You're in good hands with HighScalability



  • Netflix - Cassandra, AWS, 288 instances, 3.3 million writes per second.

  • Quotable quotes:

    • @bretlowery : "A #DBA walks into a #NoSQL bar, but turns and leaves because he couldn't find a table."

    • @AdanVali : HP to Deploy Memristor Powered SSD Replacement Within 18 Months

    • @eden : Ori Lahav: "When planning scalability, think x100, design x5 and deploy x1.5 of current traffic"

    • @jkalucki : If you are IO bound, start with your checkbook!



  • Everything I Ever Learned About JVM Performance Tuning @Twitter. Learn how to tune your Hotspot and other Javasutra secrets.

  • By moving off the cloud Mixipanel may have lost their angel status. Why would they do such a thing? Read Why We Moved Off The Cloud for the details. The reason for the fall:  highly variable performance. Highly variable performance is incredibly hard to code or design around (think a server that normally does 300 queries per second with low I/O wait suddenly dropping to 50 queries second at 100% disk utilization for literally hours). It’s solvable, certainly, but with lots of time and money and it’s hard to justify the cost when there’s a better alternative available. On reddit. On Hacker News. Is that a bell I hear?



Want more of what the Internet has to say on scalability? Then click below before it's too late...

Friday, November 4, 2011

Finding the Right Data Solution for Your Application in the Data Storage Haystack

Finding the Right Data Solution for Your Application in the Data Storage Haystack:



The InfoQ article Finding the Right Data Solution for Your Application in the Data Storage Haystack makes a series of concrete recommendations for a user who wants to find the right storage solution for his application.  


Few years back, there was a time SQL RDBMS were solution for almost all storage needs, but we all know how scaling came along and shattered the perfect dream. Then NoSQL happened, and now we are end up with a Haystack of solutions. For example, Local memory, Relational, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. are some classes of such solutions.


We discuss about how to find the right storage solution, and we make choices often when we design. But, when comes to describe how to select the right one, we often end up giving very high-level guideline. The article argues that the way to make more concrete recommendations is to drill down into bit more detail and consider them case by case.


To that end the article takes four parameters about an application/usecase (Scale, Consistency, Type of Data, and Queries needed), then take some 40+ cases that arises from different value combination of those parameters and make one or more concrete recommendations on right storage solution for that case.


What follows are the four parameters and potential values they can take and the recommendations for structured, semi-structured, and unstructured data: 

Introduction To URL Rewriting

Introduction To URL Rewriting:













 







 








Many Web companies spend hours and hours agonizing over the best domain names for their clients. They try to find a domain name that is relevant and appropriate, sounds professional yet is distinctive, is easy to spell and remember and read over the phone, looks good on business cards and is available as a dot-com.


Or else they spend thousands of dollars to purchase the one they really want, which just happened to be registered by a forward-thinking and hard-to-find squatter in 1998.


They go through all that trouble with the domain name but neglect the rest of the URL, the element after the domain name. It, too, should be relevant, appropriate, professional, memorable, easy to spell and readable. And for the same reasons: to attract customers and improve in search ranking.


Fortunately, there is a technique called URL rewriting that can turn unsightly URLs into nice ones — with a lot less agony and expense than picking a good domain name. It enables you to fill out your URLs with friendly, readable keywords without affecting the underlying structure of your pages.


This article covers the following:



  1. What is URL rewriting?

  2. How can URL rewriting help your search rankings?

  3. Examples of URL rewriting, including regular expressions, flags and conditionals;

  4. URL rewriting in the wild, such as on Wikipedia, WordPress and shopping websites;

  5. Creating friendly URLs;

  6. Changing pages names and URLs;

  7. Checklist and troubleshooting.


What Is URL Rewriting?


If you were writing a letter to your bank, you would probably open your word processor and create a file named something like lettertobank.doc. The file might sit in your Documents directory, with a full path like C:\Windows\users\julie\Documents\lettertobank.doc. One file path = one document.


Similarly, if you were creating a banking website, you might create a page named page1.html, upload it, and then point your browser to http://www.mybanksite.com/page1.html. One URL = one resource. In this case, the resource is a physical Web page, but it could be a page or product drawn from a CMS.


URL rewriting changes all that. It allows you to completely separate the URL from the resource. With URL rewriting, you could have http://www.mybanksite.com/aboutus.html taking the user to …/page1.html or to …/about-us/ or to …/about-this-website-and-me/ or to …/youll-never-find-out-about-me-hahaha-Xy2834/. Or to all of these. It’s a bit like shortcuts or symbolic links on your hard drive. One URL = one way to find a resource.


With URL rewriting, the URL and the resource that it leads to can be completely independent of each other. In practice, they’re usually not wholly independent: the URL usually contains some code or number or name that enables the CMS to look up the resource. But in theory, this is what URL rewriting provides: a complete separation.


How Does URL Rewriting Help?


Can you guess what this Web page sells?


http://www.diy.com/diy/jsp/bq/nav.jsp?action=detail&fh_secondid=11577676


B&Q went to all the trouble and expense of acquiring diy.com and implementing a stock controlled e-commerce website, but left its URLs indecipherable. If you guessed “brown guttering,” you might want to considering playing the lottery.


Even when you search directly for this “miniflow gutter brown” on Google UK, B&Q’s page comes up only seventh in the organic search results, below much smaller companies, such as a building supplier with a single outlet in Stirlingshire. B&Q has 300+ branches and so is probably much bigger in budget, size and exposure, so why is it not doing as well for this search term? Perhaps because the other search results have URLs like http://www.prof…co.uk/products/brown-miniflo-gutter-148/; that is, the URL itself contains the words in the search term.


screenshot


Almost all of these results on Google have the search term in their URLs (highlighted in green). The one at the bottom does not.


Looking at the URL from B&Q, you would (probably correctly) assume that a file named nav.jsp within the directory /diy/jsp/bq/ is used to display products when given their ID number, 11577676 in this case. That is the resource intimately tied to this URL.


So, how would B&Q go about turning this into something more recognizable, like http://www.diy.com/products/miniflow-gutter-brown/11577676, without restructuring its whole website? The answer is URL rewriting.


Another way to look at URL rewriting is like a thin layer that sits on top of a website, translating human- and search-engine-friendly URLs into actual URLs. Doing it is easy because it requires hardly any changes to the website’s underlying structure — no moving files around or renaming things.


URL rewriting basically tells the Web server that

/products/miniflow-gutter-brown/11577676 should show the Web page at: /diy/jsp/bq/nav.jsp?action=detail&fh_secondid=11577676,

without the customer or search engine knowing about it.


Many factors (or “signals”), of course, determine the search ranking for a particular term, over 200 of them according to Google. But friendly and readable URLs are consistently ranked as one of the most important of those factors. They also help humans to quickly figure out what a page is about.


The next section describes how this is done.


How To Rewrite URLs


Whether you can implement URL rewriting on a website depends on the Web server. Apache usually comes with the URL rewriting module, mod_rewrite, already installed. The set-up is very common and is the basis for all of the examples in this article. ISAPI Rewrite is a similar module for Windows IIS but requires payment (about $100 US) and installation.


The Simplest Case


The simplest case of URL rewriting is to rename a single static Web page, and this is far easier than the B&Q example above. To use Apache’s URL rewriting function, you will need to create or edit the .htaccess file in your website’s document root (or, less commonly, in a subdirectory).


For instance, if you have a Web page about horses named Xu8JuefAtua.htm, you could add these lines to .htaccess:


RewriteEngine On
RewriteRule   horses.htm   Xu8JuefAtua.htm

Now, if you visit http://www.mywebsite.com/horses.htm, you’ll actually be shown the Web page Xu8JuefAtua.htm. Furthermore, your browser will remain at horses.htm, so visitors and search engines will never know that you originally gave the page such a cryptic name.


Introducing Regular Expressions


In URL rewriting, you need only match the path of the URL, not including the domain name or the first slash. The rule above essentially tells Apache that if the path contains horses.htm, then show the Web page Xu8JuefAtua.htm. This is slightly problematic, because you could also visit http://www.mywebsite.com/reallyfasthorses.html, and it would still work. So, what we really need is this:


RewriteEngine On
RewriteRule   ^horses.htm$   Xu8JuefAtua.htm

The ^horses.htm$ is not just a search string, but a regular expression, in which special characters — such as ^ . + * ? ^ ( ) [ ] { } and $ — have extra significance. The ^ matches the beginning of the URL’s path, and the $ matches the end. This says that the path must begin and end with horses.htm. So, only horses.htm will work, and not reallyfasthorses.htm or horses.html. This is important for search engines like Google, which can penalize what it views as duplicate content — identical pages that can be reached via multiple URLs.


Without File Endings


You can make this even better by ditching the file ending altogether, so that you can visit either http://www.mywebsite.com/horses or http://www.mywebsite.com/horses/:


RewriteEngine On
RewriteRule   ^horses/?$   Xu8JuefAtua.html  [NC]

The ? indicates that the preceding character is optional. So, in this case, the URL would work with or without the slash at the end. These would not be considered duplicate URLs by a search engine, but would help prevent confusion if people (or link checkers) accidentally added a slash. The stuff in brackets at the end of the rule gives Apache some further pointers. [NC] is a flag that means that the rule is case insensitive, so http://www.mywebsite.com/HoRsEs would also work.


Wikipedia Example


We can now look at a real-world example. Wikipedia appears to use URL rewriting, passing the title of the page to a PHP file. For instance…


http://en.wikipedia.org/wiki/Barack_obama



… is rewritten to:


http://en.wikipedia.org/w/index.php?title=Barack_obama


This could well be implemented with an .htaccess file, like so:


RewriteEngine On
#Look for the word "wiki" followed by a slash, and then the article title
RewriteRule   ^wiki/(.+)$   w/index.php?title=$1   [L]

The previous rule had /?, which meant zero or one slashes. If it had said /+, it would have meant one or more slashes, so even http://www.mywebsite.com/horses//// would have worked. In this rule, the dot (.) matches any character, so .+ matches one or more of any character — that is, essentially anything. And the parentheses — ( ) — ask Apache to remember what the .+ is. The rule above, then, tells Apache to look for wiki/ followed by one or more of any character and to remember what it is. This is remembered and then rewritten as $1. So, when the rewriting is finished, wiki/Barack_obama becomes w/index.php?title=Barack_obama


Thus, the page w/index.php is called, passing Barack_obama as a parameter. The w/index.php is probably a PHP page that runs a database lookup — like SELECT * FROM articles WHERE title='Barack obama' — and then outputs the HTML.


screenshot


You can also view Wikipedia entries directly, without the URL rewriting.


Comments and Flags


The example above also introduced comments. Anything after a # is ignored by Apache, so it’s a good idea to explain your rewriting rules so that future generations can understand them. The [L] flag means that if this rule matches, Apache can stop now. Otherwise, Apache would continue applying subsequent rules, which is a powerful feature but unnecessary for all but the most complex rule sets.


Implementing the B&Q Example


The recommendation for B&Q above could be implemented with an .htaccess file, like so:


RewriteEngine On
#Look for the word "products" followed by slash, product title, slash, id number
RewriteRule  ^products/.*/([0-9]+)$   diy/jsp/bq/nav.jsp?action=detail&fh_secondid=$1 [NC,L]

Here, the .* matches zero or more of any character, so nothing or anything. And the [0-9] matches a single numerical digit, so [0-9]+ matches one or more numbers.


The next section covers a couple of more complex conditional examples. You can also read the Apache rewriting guide for much more information on all that URL rewriting has to offer.


Conditional Rewriting


URL rewriting can also include conditions and make use of environment variables. These two features make for an easy way to redirect requests from one domain alias to another. This is especially useful if a website changes its domain, from mywebsite.co.uk to mywebsite.com for example.


Domain Forwarding


Most domain registrars allow for domain forwarding, which redirects all requests from one domain to another domain, but which might send requests for www.mywebsite.co.uk/horses to the home page at www.mywebsite.com and not to www.mywebsite.com/horses. You can achieve this with URL rewriting instead:


RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.mywebsite.com$ [NC]
RewriteRule   (.*)         http://www.mywebsite.com/$1  [L,R=301]

The second line in this example is a RewriteCond, rather than a RewriteRule. It is used to compare an Apache environment variable on the left (such as the host name in this case) with a regular expression on the right. Only if this condition is true will the rule on the next line be considered.


In this case, %{HTTP_HOST} represents www.mywebsite.co.uk, the host (i.e. domain) that the browser is trying to visit. The ! means “not.” This tells Apache, if the host does not begin and end with www.mywebsite.com, then remember and rewrite zero or more of any character to www.mywebsite.com/$1. This converts www.mywebsite.co.uk/anything-at-all to www.mywebsite.com/anything-at-all. And it will work for all other aliases as well, like www.mywebsite.biz/anything-at-all and mywebsite.com/anything-at-all.


The flag [R=301] is very important. It tells Apache to do a 301 (i.e. permanent) redirect. Apache will send the new URL back to the browser or search engine, and the browser or search engine will have to request it again. Unlike all of the examples above, the new URL will now appear in the browser’s location bar. And search engines will take note of the new URL and update their databases. [R] by itself is the same as [R=302] and signifies a temporary redirect.


File Existence and WordPress


Smashing Magazine runs on the popular blogging software WordPress. WordPress enables the author to choose their own URL, called a “slug.” Then, it automatically prepends the date, such as http://coding.smashingmagazine.com/2011/09/05/getting-started-with-the-paypal-api/. In your pre-URL rewriting days, you might have assumed that Smashing Magazine’s Web server was actually serving up a file located at …/2011/09/05/getting-started-with-the-paypal-api/index.html. In fact, WordPress uses URL rewriting extensively.


screenshot


WordPress enables the author to choose their own URL for an article.


WordPress’ .htaccess file looks like this:


RewriteEngine On
RewriteBase /  
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

The -f means “this is a file” and -d means “this is a directory.” This tells Apache, if the requested file name is not a file, and the requested file name is not a directory, then rewrite everything (i.e. any path containing any character) to the page index.php. If you are requesting an existing image or the log-in page wp-login.php, then the rule is not triggered. But if you request anything else, like /2011/09/05/getting-started-with-the-paypal-api/, then the file index.php jumps into action.


Internally, index.php (probably) looks at the environment variable $_SERVER['REQUEST_URI'] and extracts the information that it needs to find out what it is looking for. This gives it even more flexibility than Apache’s rewrite rules and enables WordPress to mimic some very sophisticated URL rewriting rules. In fact, when administering a WordPress blog, you can go to Settings → Permalink on the left side, and choose the type of URL rewriting that you would like to mimic.


screenshot


WordPress’ permalink settings, letting you choose the type of URL rewriting that you would like to mimic.


Rewriting Query Strings


If you are hired to recreate an existing website from scratch, you might use URL rewriting to redirect the 20 most popular URLs on the old website to the locations on the new website. This could involve redirecting things like prod.php?id=20 to products/great-product/2342, which itself gets redirected to the actual product page.


Apache’s RewriteRule applies only to the path in the URL, not to parameters like id=20. To do this type of rewriting, you will need to refer to the Apache environment variable %{QUERY_STRING}. This can be accomplished like so:


RewriteEngine On
RewriteCond %{QUERY_STRING} ^id=20$                   
RewriteRule   ^prod.php$ ^products/great-product/2342$ [L,R=301]
RewriteRule   ^products/(.*)/([0-9]+)$ ^productview.php?id=$1 [L]

In this example, the first RewriteRule triggers a permanent redirect from the old website’s URL to the new website’s URL. The second rule rewrites the new URL to the actual PHP page that displays the product.


Examples Of URL Rewriting On Shopping Websites


For complex content-managed websites, there is still the issue of how to map friendly URLs to underlying resources. The simple examples above did that mapping by hand, manually associating a URL like horses.htm with the file or resource Xu8JuefAtua.htm. Wikipedia looks up the resource based on the title, and WordPress applies some complex internal rule sets. But what if your data is more complex, with thousands of products in hundreds of categories? This section shows the approach that Amazon and many other shopping websites take.


If you’ve ever come across a URL like this on Amazon, http://www.amazon.co.uk/High-Voltage-AC-DC/dp/B00008AJL3, you might have assumed that Amazon’s website has a subdirectory named /High-Voltage-AC-DC/dp/ that contains a file named B00008AJL3.


This is very unlikely. You could try changing the name of the top-level “directory” and you would still arrive on the same page, http://www.amazon.co.uk/Test-Voltage-AC-DC/dp/B00008AJL3.


The bit at the end is what really matters. Looking down the page, you’ll see that B00008AJL3 is this AC/DC album’s ASIN (Amazon Standard Identification Number). If you change that, you’ll get a “Page not found” or an entirely different product: http://www.amazon.co.uk/High-Voltage-AC-DC/dp/B003BEZ7HI.


The /dp/ also matters. Changing this leads to a “Page not found.” So, the B00008AJL3 probably tells Amazon what to display, and the dp tells the website how to display it. This is URL rewriting in action, with the original URL possibly ending up getting rewritten to something like:

http://www.amazon.co.uk/displayproduct.php?asin=B00008AJL3.


Features of an Amazon URL


This introduces some important features of Amazon’s URLs that can be applied to any website with a complex set of resources. It shows that the URL can be automatically generated and can include up to three parts:



  1. The wordsIn this case, the words are based on the album and artist, and all non-alphanumeric characters are replaced. So, the slash in AC/DC becomes a hyphen. This is the bit that helps humans and search engines.

  2. An ID numberOr something that tells the website what to look up, such as B00008AJL3.

  3. An identifierOr something that tells the website where to look for it and how to display it. If dp tells Amazon to look for a product, then somewhere along the line, it probably triggers a database statement such as SELECT * FROM products WHERE id='B00008AJL3'.


Other Shopping Examples


Many other shopping websites have URLs like this. In the list below, the ID number and (suspected) identifier are in bold:



  • http://www.ebay.co.uk/itm/Ian-Rankin-Set-Darkness-Rebus-Novel-/140604842997

  • http://www.kelkoo.com/c-138201-lighting/brand/caravan

  • http://www.ciao.co.uk/Fridge_Freezers_5266430_3

  • http://www.gumtree.com/p/for-sale/boys-bmx-bronx-blaze/97669042

  • http://www.comet.co.uk/c/Televisions/LCD-Plasma-LED-TVs/1844


A significant benefit of this type of URL is that the actual words can be changed, as shown below. As long as the ID number stays the same, the URL will still work. So products can be renamed without breaking old links. More sophisticated websites (like Ciao above) will redirect the changed URL back to the real one and thus avoid creating the appearance of duplicate content (see below for more on this topic).


screenshot


Websites that use URL rewriting are more flexible with their URLs — the words can change but the page will still be found.


Friendly URLs


Now you know how to map nice friendly URLs to their underlying Web pages, but how should you create those friendly URLs in the first place?


If we followed the current advice, we would separate words with hyphens rather than underscores and capitalize consistently. Lowercase might be preferable because most people search in lowercase. Punctuation such as dots and commas should also be turned into hyphens, otherwise they would get turned into things like %2C, which look ugly and might break the URL when copied and pasted. You might want to remove apostrophes and parentheses entirely for the same reason.


Whether to replace accented characters is debatable. URLs with accents (or any non-Roman characters) might look bad or break when rendered in a different character format. But replacing them with their non-accented equivalents might make the URLs harder for search engines to find (and even harder if replaced with hyphens). If your website is for a predominately French audience, then perhaps leave the French accents in. But substitute them if the French words are few and far between on a mainly English website.


This PHP function succinctly handles all of the above suggestions:


function GenerateUrl ($s) {
//Convert accented characters, and remove parentheses and apostrophes
$from = explode (',', "ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u,(,),[,],'");
$to = explode (',', 'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u,,,,,,');
//Do the replacements, and convert all other non-alphanumeric characters to spaces
$s = preg_replace ('~[^\w\d]+~', '-', str_replace ($from, $to, trim ($s)));
//Remove a - at the beginning or end and make lowercase
return strtolower (preg_replace ('/^-/', '', preg_replace ('/-$/', '', $s)));
}

This would generate URLs like this:


echo GenerateUrl ("Pâtisserie (Always FRESH!)"); //returns "patisserie-always-fresh"

Or, if you wanted a link to a $product variable to be pulled from a database:


$product = array ('title'=>'Great product', 'id'=>100);
echo '<a href="' . GenerateUrl ($product['title']) . '/' . $product['id'] . '">';
echo $product['title'] . '</a>';

Changing Page Names


Search engines generally ignore duplicate content (i.e. multiple pages with the same information). But if they think they are being manipulated, search engines will actively penalize the website, so avoid this where possible. Google recommends using 301 redirects to send users from old pages to new ones.


When a URL-rewritten page is renamed, the old URL and new URL should both still work. Furthermore, to avoid any risk of duplication, the old URL should automatically redirect to the new one, as WordPress does.


Doing this in PHP is relatively easy. The following function looks at the current URL, and if it’s not the same as the desired URL, it redirects the user:


function CheckUrl ($s) {
// Get the current URL without the query string, with the initial slash
$myurl = preg_replace ('/\?.*$/', '', $_SERVER['REQUEST_URI']);
//If it is not the same as the desired URL, then redirect
if ($myurl != "/$s") {Header ("Location: /$s", true, 301); exit;}
}

This would be used like so:


$producturl = GenerateUrl ($product['title']) . '/' . $product['id'];
CheckUrl ($producturl); //redirects the user if they are at the wrong place

If you would like to use this function, be sure to test it in your environment first and with your rewrite rules, to make sure that it does not cause any infinite redirects. This is what that would look like:


screenshot


This is what happens when Google Chrome visits a page that redirects to itself.


Checklist And Troubleshooting


Use the following checklist to implement URL rewriting.


1. Check That It’s Supported


Not all Web servers support URL rewriting. If you put up your .htaccess file on one that doesn’t, it will be ignored or will throw up a “500 Internal Server Error.”


2. Plan Your Approach


Figure out what will get mapped to what, and how the correct information will still get found. Perhaps you want to introduce new URLs, like my-great-product/p/123, to replace your current product URLs, like product.php?id=123, and to substitute new-category/c/12 for category.php?id=12.


3. Create Your Rewrite Rules


Create an .htaccess file for your new rules. You can initially do this in a /testing/ subdirectory and using the [R] flag, so that you can see where things go:


RewriteEngine On
RewriteRule ^.+/p/([0-9]+) product.php?id=$1 [NC,L,R]
RewriteRule ^.+/c/([0-9]+) category.php?id=$1 [NC,L,R]

Now, if you visit www.mywebsite.com/testing/my-great-product/p/123, you should be sent to www.mywebsite.com/testing/product.php?id=123. You’ll get a “Page not found” because product.php is not in your /testing/ subdirectory, but at least you’ll know that your rules work. Once you’re satisfied, move the .htaccess file to your document root and remove the [R] flag. Now www.mywebsite.com/my-great-product/p/123 should work.


4. Check Your Pages


Test that your new URLs bring in all the correct images, CSS and JavaScript files. For example, the Web browser now believes that your Web page is named 123 in a directory named my-great-product/p/. If the HTML refers to a file named images/logo.jpg, then the Web browser would request the image from www.mywebsite.com/my-great-product/p/images/logo.jpg and would come up with a “File not found.”


You would need to also rewrite the image locations or make the references absolute (like <img src="/images/logo.jpg"/>) or put a base href at the top of the <head> of the page (<base href="/product.php"/>). But if you do that, you would need to fully specify any internal links that begin with # or ? because they would now go to something like product.php#details.


5. Change Your URLs


Now find all references to your old URLs, and replace them with your new URLs, using a function such as GenerateUrl to consistently create the new URLs. This is the only step that might require looking deep into the underlying code of your website.


6. Automatically Redirect Your Old URLs


Now that the URL rewriting is in place, you probably want Google to forget about your old URLs and start using the new ones. That is, when a search result brings up product.php?id=20, you’d want the user to be visibly redirected to my-great-product/p/123, which would then be internally redirected back to product.php?id=20.


This is the reverse of what your URL rewriting already does. In fact, you could add another rule to .htaccess to achieve this, but if you get the rules in the wrong order, then the browser would go into a redirect loop.


Another approach is to do the first redirect in PHP, using something like the CheckUrl function above. This has the added advantage that if you rename the product, the old URL will immediately become invalid and redirect to the newest one.


7. Update and Resubmit Your Site Map


Make sure to carry through your new URLs to your site map, your product feeds and everywhere else they appear.


Conclusion


URL rewriting is a relatively quick and easy way to improve your website’s appeal to customers and search engines. We’ve tried to explain some real examples of URL rewriting and to provide the technical details for implementing it on your own website. Please leave any comments or suggestions below.


(al)




© Paul Tero for Smashing Magazine, 2011.