How Twitter Operates Its IT infrastructure: From Process to Tools
This post is based on my notes taken at the talk by John Adams at LISA 2010 conference. Any mistakes, if any, are all mine. Should you be interested in other sites, check out Google, Facebook, LinkedIn.
As one of the leading social Web site with 165M users, Twitter demands a huge infrastructure support its operation. There are 700M searches and 1,000 tweets per second and can go up to almost 4,000 at peak. The number of tweets is not that impressive, but these tweets need to be distributed to numerous followers which could be several millions after one account.
Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.
These days Twitter gets 75% traffic from API and 25% from the Web. The new twitter.com Web interface heavily uses AJAX and acts as API client to its backend.
As John put it, “nothing works the first time.” His recommendation is to use the best available technology for scaling. You will need to plan and build for more than one time to get it right.
As other social Web sites, Twitter uses free OSes. They found Unix friends fails at scale with cron, syslog, etc. The syslog has truncation, data loss, and aggregation issues during high traffic.
The mantra to manage large scale web site is to find the weakest point, and then take corrective actions. When it’s done, move on to next weakest point. The key to maintain MTTD (Mean Time to Detect) and MTTR(Mean Time to Recovery).
Operating a large Web site involves many different activities from daily operations to long term strategic planning. Whenever possible, instrument all these activities. It will pay off the efforts.
- Monitoring: Visualize critical metrics using APIs into a single glass pane
- Profiling: find bottlenecks (latency, net usage, memory leaks) using network service tools like tcpdump, tGoogel PerfTool)
- Forecasting: predict the future capacity requirement using tools like fityk, curveFit.
- Configuration management: it should start EARLY in your company; don’t wait. Twitter uses Puppet + SVN to manage hundreds of modules.
- Runs constantly
- Post-commit idiot decks
- No one logs into machine
- Centralized change
Software Tools and Frameworks
Besides the tools mentioned in last part, Twitter also uses these tools for building and managing its infrastructure. You may find some of them useful to your projects.
- Loony: it’s an asset management tool that track machines with mySQL and LDAP. It allows you to list, filter, and search machines. Not sure it’s an internal or open source tool, but I cannot find much information from Web.
- Murder: it’s a bit-torrent based replication software for deployments. In 30 to 60 seconds, it can deploy 1,000 machines at Twitter. It comes with python library.
- ReviewBoard: this project was created by my VMware colleagues Christian Hammond and David Trowbridge. It “takes the pain out of code review” by displaying changes from source control system in Web interfaces for reviewers to comment. I use it at VMware and like it very much.
- Scribe: Twitter used syslog and found it doesn’t work at high traffics due to data loss or truncation. So they switched to Scribe, developed and open sourced by Facebook, with their patches on compression. The log is saved locally and then sent to an aggregation node.
- Hadoop: it’s used to analyze huge amount of data like log data and turn it into information shown on dashboard.
- FlockDB: it’s a home grown distributed DB for storing graph information (think about the following relationship). This is not a system administration tool per se, but nice to know.
- Kestrel: it’s Twitter’s “new message queue, written in 1,500 lines in Scala.”
- Thrift: it’s a framework for building distributed software. It’s very similar to SOAP except that Thrift is much efficient in transfer same amount of data.
- RAILS: it’s a Web framework famous for “Convention over Configuration.” It’s used as front end at Twitter. A while back, I heard Rails does not scale well and some users went back to PHP. According to John, it’s not the case. If Twitter is fine with its scalability, I bet any other enterprises or Web sites should be just fine.
System Admin 2.0
SystemAdmin 2.0 has another name called devops which I blogged a while back. According to John, with new technologies system administrators need to pick up programming skills. The system management is becoming programming tasks using tools like Cfengine, Chef, Puppet, etc. There’s no more solos.
Want to read more? I just find John’s presentation went online here.