Last week I attended a great talk by Google Fellow Jeffrey Dean at Stanford University. Jeff talked about his first hand experience on building software systems at Google since 1999 and lessons learned. The following summary is solely based on my notes, therefore may contain my misunderstandings.
A Brief History
Bothered by SLOW Web UI to manage vSphere? Want to manage ALL your VMware vCenters, AWS, Azure, Openstack, container behind a SINGLE pane of glass? Want to search, analyze, report, visualize VMs, hosts, networks, datastores, events as easily as Google the Web? Find out more about vSearch 3.0: the search engine for all your private and public clouds.
During the past 10 years or so, the scale of the Google infrastructure has grown exponentially: # docs 1,000X; #query, 1,000X; per doc index, 3X; update rate from months to seconds, 50,000X; query latency, 5X; computer and computing powers, 1,000X. The underlying infrastructure has experienced 7 major revisions in the last 11 years.
At the concept level, the search infrastructure is simple. It has web servers upfront taking search queries. The queries are then passed on to two different types of servers: index servers and doc servers. For the index server, the input is the query string and the output is an array of doc-id and score pairs. For the doc servers, the input is the doc-id and query pair and the output is the title and snippet of the doc. Note that the snippet of the doc is query dependent so that you can find your keywords highlighted in the result pages. How to quickly and accurately calculate the output based on input involves a lot of advanced algorithms, and is not in the scope of Jeff’s talk.
To improve performance, you can easily think of caching. Yes, Google search does have caching servers for both index and doc servers. According to Jeff, the hit rate is about 30% to 60%. It’s very nice of course. But the system may experience latency spike and capacity drop when index got updated or cache flushed.
Jeff recalled during the early days when the indexed pages grew from 50M to 1000M, and the search requests grew 20% on monthly basis. After they cut a deal with Yahoo on search, the workload doubled overnight. In addition to pulling in more machines, software improvement also contributed 10% to 30%.
When the index system scales up, it’s divided into many shards, each of which holds many replicas for sharing workload. As you can imagine, before each shard there will be a load balancer for dispatching requests.
Some of the search queries can cause huge IO. One example Jeff gave is “circle of life” as one phase enclosed in double quotation marks. It could incur 30GB I/O before. As I just searched on Google, the phrase is now a song name. I bet it’s now in Google’s cache server.
When there are so many servers to manage, something unique happened. Jeff mentioned an interesting phenomenon called “query of death.” If a query can causes a server to crash, then it can crash all other servers because the software stack is the same. To avoid large scale of crashes, they used canary request which is first sent to one machine. If it’s good, then send it to the rest of machines; otherwise reject the request after failing several times. It of course adds a little delay but far better than large scale system crash. Of course, you need to log down the query and found out why it crashed software as a process of continuous improvement. Jeff didn’t mention this, but I bet Google did that.
In 2004, there is another big upgrade. A new hierarchy was introduced as root server, parent server, and leaf servers. It’s designed for high performance with assumption that data is in memory. Of course, you cannot use memory the same way as hard disk, therefore they have new format and encoding for efficiency.
In 2007, Google introduced Universal Search in which different format of contents like Web, images, news, blogs, videos, books, are all searched at the same time. Then a question came up: how to rate these from different format and put them into single result page? Again, not really a technical question and there leave out here.
System Software Stack
Google uses commodity hardware they assemble by themselves, and Linux as the operating system. For the large deployment in a scale of Google, the system availability is a big challenge especially they are using commodity hardware.
As Jeff pointed out, “Reliability and Availability MUST come from software.” To understand it, you cannot easily guarantee 100% availability of your hardware at reasonable prices. Even you can guarantee it, you cannot guarantee there is no bug in your operating system or your application especially your software is evolving all the time at fast pace.
Google’s solution is their cluster software. A cluster has 5K to 20K machines. It’s still a myth on how many machines/cluster/datacenters Google has these days. Google treats it as a secret. One thing is for sure it has many datacenters in different locations across the world.
To handle large amount of data, Google invented Google File System (GFS) and BigTable for distributed access.
For large scale processing, Google invented MapReduce in 2003. A big problem can be divided into smaller ones which can be distributed to run on many machines. The results are then aggregated. With availability built in software, the system can be very fault tolerant. As Jeff mentioned one sample in which 1,600 to 1,800 machines were lost but the overall computing was done just fine. What happened was operation team just unplugged the cable of those machines as part of maintenance without notifying the development team.
Jeff laid out several lessons they have learned along building one of the world’s largest computing infrastructures:
- Many internal services. These services should have few dependencies, are easy to test and deploy new versions. The development cycles are largely decoupled. Google.com for example leverages 200+ internal services.
- Design efficient system. A good designer should know the performance of a designed system before it’s built. How? She/he should know the basic numbers of processing costs and calculate the overall performance on the back of an envelope.
- Not try to be all things to all people. His rule of the thumb is to do 6 of them. Having 7 of them requires real thoughts. Having 8 of them results in a worse system.
- Don’t build infrastructure just for its own sake. It means you must identify common needs and don’t image them. The best way to do this is to use your own infrastructure so that you will have rapid feedback, and shorter cycle. This is another way of “Eat your own dog food.”
- Design for growth but don’t design to scale infinitely. You can shoot for 5 to 10 time growth. If a system grows 1,000 times, it requires rethinking and rewriting.
- Single master with 1,000+ workers clustering. You can use hot standby master. The master is for management and scheduling. Always let your client talk to the worker machine directly so that the master can avoid being a bottleneck. For each worker machine, have multiple smaller computing units or tasks, and have faster recovery in place. For scheduling purpose, these tasks have requirements in terms of CPU, Memory, network, and disk space.
It’s interesting to review the history, even so to find out what’s now and next. As a Google Fellow, Jeff oversees the infrastructure. What he does now is probably what’s next for Google infrastructure. So what is he working on now?
It’s a project called Spanner. Its purpose is indirection of data, or should we call it data virtualization? Today the data has its locality, meaning you have to know exactly where the data is. If for whatever reason it moves to another location, your application could break. The goal is to make 10~18 bytes of data transparent across 100 to 1,000 locations, and accessible by 10~5 t0 10~7 machines.