Critical Lessons Learned at Facebook on Scalability and Reliability is no doubt the biggest web site surpassing Google in terms of Web traffics in an article published half year ago. Given its scale, the lessons learned would be very helpful for others to build scalable IT infrastructures. This post is based on my notes taken at the talk by Robert Johnson and Sanjeev Kumar at LISA 2010 conference. Should there be any mistakes, they are all mine.

According to the speakers, the architecture of is relatively simple: Web servers in the front, databases at the back. In the middle is a caching layer with a lot of memcached servers. If you recall my previous post, they use PHP extensively.

Bothered by SLOW Web UI to manage vSphere? Want to manage ALL your VMware vCenters, AWS, Azure, Openstack, container behind a SINGLE pane of glass? Want to search, analyze, report, visualize VMs, hosts, networks, datastores, events as easily as Google the Web? Find out more about vSearch 3.0: the search engine for all your private and public clouds.

Unlike other sites, like email sites, whose users are well mapped and isolated to different servers, social Websites like Facebook have unique challenges in that their users are linked together. Errors in one part of a system may cascade easily and bring down the whole site.

Here are several important lessons Facebook learned while building software and operating the site:

  1. Only change one thing at a time.
    A small change can cause a big problem. Changing one thing at a time can back-trace following problems to the change. To facilitate these small changes, you need supporting infrastructure like configuration management system.
  2. No single point of failures
    1. Redundant hardware at every level from load balancers to network links.
    2. Software can be a SPOF as well
  3. Don’t make small problems big
    1. Don’t push problems upstream
    2. Be aware of smart failovers, which can be dangerous
    3. Shutdown load when in trouble
    4. Reduce system startup time
  4. Measure everything
    1. Data is key: activities, performance stats, failure rates
    2. Discover problems that you didn’t know you had through analysis
  5. Always do a post-mortem
    1. Weekly review
    2. Focusing on issues not blame
    3. Identify technical and organizational issues and follow-up actions
  6. Culture is important
    1. Release often
    2. Control and responsibility

This entry was posted in Software Development and tagged , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Comment

  1. Posted April 2, 2014 at 3:57 pm | Permalink

    Hi, that is a pretty useful site. Great job. Thanks.

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


    My company has created products like vSearch ("Super vCenter"), vijavaNG APIs, EAM APIs, ICE tool. We also help clients with virtualization and cloud computing on customized development, training. Should you, or someone you know, need these products and services, please feel free to contact me: steve __AT__

    Me: Steve Jin, VMware vExpert who authored the VMware VI and vSphere SDK by Prentice Hall, and created the de factor open source vSphere Java API while working at VMware engineering. Companies like Cisco, EMC, NetApp, HP, Dell, VMware, are among the users of the API and other tools I developed for their products, internal IT orchestration, and test automation.