Critical Lessons Learned at Facebook on Scalability and Reliability
Facebook.com is no doubt the biggest web site surpassing Google in terms of Web traffics in an article published half year ago. Given its scale, the lessons learned would be very helpful for others to build scalable IT infrastructures. This post is based on my notes taken at the talk by Robert Johnson and Sanjeev Kumar at LISA 2010 conference. Should there be any mistakes, they are all mine.
According to the speakers, the architecture of Facebook.com is relatively simple: Web servers in the front, databases at the back. In the middle is a caching layer with a lot of memcached servers. If you recall my previous post, they use PHP extensively.
Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.
Unlike other sites, like email sites, whose users are well mapped and isolated to different servers, social Websites like Facebook have unique challenges in that their users are linked together. Errors in one part of a system may cascade easily and bring down the whole site.
Here are several important lessons Facebook learned while building software and operating the site:
- Only change one thing at a time.
A small change can cause a big problem. Changing one thing at a time can back-trace following problems to the change. To facilitate these small changes, you need supporting infrastructure like configuration management system.
- No single point of failures
- Redundant hardware at every level from load balancers to network links.
- Software can be a SPOF as well
- Don’t make small problems big
- Don’t push problems upstream
- Be aware of smart failovers, which can be dangerous
- Shutdown load when in trouble
- Reduce system startup time
- Measure everything
- Data is key: activities, performance stats, failure rates
- Discover problems that you didn’t know you had through analysis
- Always do a post-mortem
- Weekly review
- Focusing on issues not blame
- Identify technical and organizational issues and follow-up actions
- Culture is important
- Release often
- Control and responsibility