Boundary: What is Facebook’s secret sauce for managing what’s got to be the biggest Big Data project, if you will, on the Web?
Hoff: From several presentations we’ve learned what Facebook insiders like Aditya Agarwaland Robert Johnson, both former Directors of Engineering, consider their secret sauce:
- Scaling Takes Iteration. Solutions often work in the beginning, but you’ll have to modify them as you go. PHP, for example, is simple to use at first, but is not a good choice when you have tens of thousands of web servers.
- Scaling Takes Iteration. You can say that again.
- Don’t Over-Design. Just use what you need as you scale your system out. Figure out where you need to iterate on a solution, optimize something, or completely build a part of the stack yourself.
- Choose the Right Tool for the Job. Realize that any choice comes with overhead. If you really need to use Python then go ahead and, we’ll try to help you succeed. Yet with that choice there is overhead, usually across deployment, monitoring, ops, and so on.
- Get the Culture Right. Build an environment internally which promotes building the right thing first and fixing as needed. Stop worrying about innovating, about breaking things, thinking big and thinking about what is the next thing you need to build after the building the first thing. Isolate the part of the culture that you value and want to preserve. It doesn’t happen automatically.
- Move Fast. Get to market first. It’s OK if you break things. For example, Facebook runs their entire Web tier on HipHop which was developed by three people. This is a risky strategy. It brings the site down regularly (out of memory, infinite loops), but there’s a big potential payoff as they figure out how to make it work.
- Empower Small Teams. Small teams can do great things. Facebook Search, photos, chat and HipHop were all the result of small teams. Get the right set of people, empower them and let them work.
- People Matter Most. It’s people who build and run systems. The best tools for scaling are an engineering and operations teams that can handle anything.
- Scale Horizontally. Handling exponentially growing traffic requires spreading load arbitrarily across many machines.
- Measure Everything. Production is where the really useful data comes from. Measure both system and application level statistics to know what’s happening.
- Gives Teams Control and Responsibility. Responsibility requires control. If a team is responsible for something they must also control it.
All these principles work together to make a self-reinforcing virtuous circle. You can’t move fast unless you have small teams who have control and responsibility. You can’t know how your changes are working unless you get those changes into production and measure results. You can’t move code into production unless people feel responsible for moving out working code. You can’t handle the scale unless you figure out how to scale horizontally, move fast and measure everything– that all comes down to good people.
But the above is not the whole of the story. Not so obvious is the role of opportunity. A pattern we often see is that companies on the leading edge see problems before everyone else, so they solve those problems before everyone else. We see a blast wave of innovation coming from technological hotspots like Google, Netflix, Twitter and Facebook.
Boundary: What other major websites do you think are doing a great job of scaling with demand, keeping users happy and response times high?
Hoff: We have a great industry. People are constantly willing to share their experiences, share their code and talk about what works. My wife is a tax accountant and they definitely don’t have the same vibe which is a little sad. There are a lot of unbelievably smart and passionate people in this field and total quality only rises the more people talk about how to build great stuff.
It’s also pretty obvious to me that having a quality site and willingness to share are linked. There are many companies I could list that fall into this category, but these stand out: Twitter, Etsy, Facebook, Google, Netflix, Amazon and StackExchange. Some other important contributors include: Airbnb, Tumblr, Instagram, TripAdvisor, Heroku, Prismatic, 37signals, Pinterest and Yahoo.
There are literally hundreds of others that could be mentioned, but these companies have continually and enthusiastically contributed to advancing the state of the art in Web performance. I feel bad already, however, because I know I’m missing some.