How does LinkedIn.com do the search?
Search has been a hot topic since Google successfully monetized it with advertising business model. Besides general Web content search like Google does, there are many other types of searches needed for other Internet companies.
LinkedIn.com, with which most of us have created profiles, offers search capabilities based criteria like keywords, names, location, industry, companies currently with and before, school, etc. Because what Linkedin.com has is well structured data, you and I expect it to do a better job than Web searches. In fact, it does.
Time to learn how to "Google" and manage your VMware and clouds in a fast and secureHTML5 App
With the curiosity, I attended a SDForum SIG seminar this evening by John Wang (try your search with Linkedin to see if you can find him), who is the Search Architect in Linkedin.com. He introduced how they use open source Apache Lucene for their search project. Along the way, they created a real-time search and indexing system based on Lucene called Zoie , and an extension to Lucene for faceted search called Bobo-browse. Both of these two projects are open sourced and hosted at Google code.
Several interesting points from the seminar are:
- Linkedin search result is “no cheating.” It has exact results. Having tried Google or other search engines, you may notices that the number of hits is actually an estimate at best. When you click along the way after 30 pages, you will see significantly smaller number of hits than claimed on the first page. It doesn’t mean it blows the numbers, but most likely does not expect you go that far.
- Their search system is distributed with 5 million profiles per partition. A broker dispatches the queries and aggregates the results. Each node has two partitions which are replicated to other servers in the cluster. They have 8G memory in each node, but don’t use that much at all, just for a peace of mind in case network traffic surges. This sounds like a good reason to get virtualized for dynamic resource allocation.
- Their system gets about 15 M queries per day. Do you know who made the most queries? The paid recruiters. Some of them can use queries with 100 conditions to nail down the right candidates. The peak to trough ratio is 5. The weekly busiest time is on Monday morning, pacific time. Again, it’s when most recruiters start to find new candidates to work on for the week.
- A special search called reference search allows you to find people in your network who have overlaps with one person. You can use it searching references for a candidate as an employer. Or you can use it search connections to the hiring manager as a candidate. That is the good side. The bad side is it’s not free. To use it, you have to upgrade your account. Technically there are quite some challenges in implementing reference search to find out the time overlaps.
- The communications between their distributed nodes using RPC are slower than the searches. Kind of surprise to know. So they are going to move to Google protobuf, which is, as claimed at project home page, used by Google “for almost all of its internal RPC protocols and file formats.” From its face value, it’s very much like the Thrift project that was initially created by Facebook, and is now part of Apache.
- Last, but not least, has something to do with cloud computing. They did a little try with Amazon, but their data is too sensitive to pass back and forth in the public cloud. I think they are not alone on this with strong concerns on securities of public cloud. That is where private cloud can come in to help.
More information on their projects can be found at http://sna-projects.com/sna/. They are hiring, BTW.
Note: This blog is based on my note and memory solely.