Hadoop Summit 2012: A Quick Summary
After the Churchill event on Hadoop for enterprises, I attended the Hadoop Summit in San Jose convention center. It’s one of the benefits living in Silicon Valley that I can attend various tech events without flying away from family for days.
According to the organizer, the Summit attracted 2,200 attendees this year, 10+ times more than the first conference of 200 attendees in 2008. It reminds me of the stellar success of VMworld which enjoyed the same growth rate before 2008. Technically wise, Hadoop seems to be in a similar stage as virtualization 10 years ago – technology is ready for wider adoption in enterprises. That explains why many people attended the conference.
The conference offers 7 different tracks: Analytics and BI, Application and Data Science, Deployment and Operations, Enterprise Data Architecture, Future of Apache Hadoop, Hadoop in Action, Reference Architecture.
Although I am on software development side and enjoy writing code, I was actually more interested in enterprise use cases/reference architecture, and operational aspects of Hadoop. I figured I could learn coding easily by myself, but not these that involve experiences that require certain conditions, for example the scale of deployment. Also, the real world use cases are always good guidance to what to learn and focus on coding side.
Here are a few sessions I attended. I wish I can attend all of them but given the time constraint I only got 1/7 of all the sessions. Hopefully I can get all slides later on and browse through them all.
- Apache Hadoop MapReduce: What’s Next? by Arun Murthy of HortonWorks, a spin-off from Yahoo. The presenter talked about the new architecture called YARN introduced in Hadoop 2 alpha released weeks ago. The YARN does not assume MapReduce processing model, but treats MapReduce as one kind of distributed processing it supports. This opens up many possibilities for other distributed processing. Performance wise, YARN yields 2+ times gain compared with old architecture. With that in mind, I attended one of the last sessions “Writing New Application Frameworks on Apache Hadoop Yarn” by Hitesh Shah. The key concepts in the new architecture are resource manager, node manager, and application master. We will dig down a bit more and may do something with the framework later on.
- Unified Big Data Architecture: Integrating Hadoop within an Enterprise Analytical Ecosystem by Priyank Patel from Teradata Aster. The Aster is a recent acquisition by Teradata for its Hadoop expertise. Priyank laid out a good case by first categorizing the data types: structure data with schema like ERP, CRM; semi-structure data like logs; unstructured data with format but no schema, like web page, video, images, and explained how to apply traditional data warehouses and Hadoop in a complimentary way.
- HMS: Scalable Configuration Management System for Hadoop by Kan Zhang and Eric Yang from IBM. The HMS addresses the management issue: how to deploy the whole Hadoop stack and manage its lifecycle in large scale clusters. The system is based on ZooKeeper, which is “a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” The HMS is forked another project called Ambari.
- Unleash Insights On All Data with Microsoft Big Data by Tim Mallalieu from Microsoft of course. The interesting part of the presentation was live demos showing how they use Microsoft front end like Excel to drive the Hadoop, and link it with Sharepoint. I was a little sleepy so I may have mistaken or missed something there even though Microsoft demos have been always great.
- Infrastructure around Hadoop – Backups, failover, configuration, and monitoring by Terran Melconian, Ed MacKenty from TripAdvisor. It introduces their accompanying tools: HDFS and Hive DDL backup system, Ganglia and Nagios based monitoring, DRBD for master failover, and Puppet for configuration management. I guess pulling these tools together around Hadoop cluster is not a rocket science, but does take effort and thinking for integration. I am not an administrator, most of these except Puppet are new to me.
- Deployment and Management of Hadoop Cluster with Ambari. As pointed out earlier, Ambari is forked from HSM, therefore share some commonalities. It aims to integrate with existing open source packaging and deployment technologies like Puppet/Chef, Yum, Apt, and Zypper (new to me). As a side note, this project will be a key piece for HortonWorks to generate revenues in the future.
- Optimizing MapReduce Job Performance by Todd Lipcon from Cloudera. A key point he made was you got to understand the internal works in order to better optimize it, which seems true for almost all the disciplines except maybe gambling. A few setting were discussed, mainly on the buffer size, the ratio of mappers and reducers, etc. I forgot all the detailed configuration parameters but will read again when needed later on. Some of the settings are only applicable in 1.x architecture, therefore I maybe never need to look back.
- Network reference architecture for Hadoop – validated and tested approach to define a reference network design for Hadoop by Nimish Desai from Cisco. You may be a little surprised to find a Cisco talk. So was I. But according to the prior talk on performance, the network IO is the top performance bottleneck for Hadoop cluster. For that reason, the room was pretty packed. The key message from the talk was 1G network is good enough mostly, and 10G is better future-looking.
- Hadoop Cluster Management by Dheeraj Kapur from Yahoo, which runs the biggest Hadoop cluster in the world. Whatever experience gained there is helpful for others. It covered tasks like upgrade OS, install bug patches, monitoring, user management, etc. The workflow based system “includes JMX based libraries designed for cluster management.” When asked whether to open source it, the presenter said no.
- Writing New Application Frameworks on Apache Hadoop Yarn. I’ve covered a bit within the first one. Definitely need to dig deeply into it and think hard what solution can leverage it. Again, I am not going to start a project just for the sake of learning something. It got to be a real world case that needs such a technology, not the other way around.
Again, I could only attend this many sessions but will study more once I got the slides of other sessions.
Exhibition and Ecosystem
Compared with other conferences I’ve attended, Hadoop Summit does not have many exhibitors. I think it’s mainly because the technology has not yet grown a strong ecosystem. The good thing is that it made my life easier to write about them.
Like I mentioned before, we IT professionals like the term stacks, which are essentially layers. Let’s use the same stack concept on the Hadoop ecosystem. The vendors mentioned are only those that showed up in the exhibition. The actual ecosystem should be bigger.
- Hardware Vendors. Whatever system vendors can be Hadoop vendors for hardware. NetApp, Cisco, SuperMicro, Dell, Intel, Arista, Mellanox (InfiniBand) etc., cover compute, storage, networking, or overall system.
The general perception in the community is that commodity hardware is the norm for Hadoop cluster because it was one of the design assumptions. It makes sense given the huge clusters in Web companies. When the scale comes down, the economic advantage may not be very obvious. Another important observation by Cisco Lab is that any node failure could in-proportionally increase the overall job processing time. Premium hardware has better uptime, therefore reduce the overall job processing time. I think premium hardware, even though more expensive, may have its place in Hadoop infrastructure.
- Hadoop Platform and Distribution. These vendors include HortonWorks, Cloudera, MapR. There are other companies like EMC Greenplum distributing Hadoop as well.
- Tooling and Management. Vendors include Talend, Datameer, StackIQ, VMware. I like the Datameer demo that hides Hadoop with Excel like Web interface for business users. I think it’s headed in the right direction.
VMware tries to solve cluster management issue with the flexibility by virtual machines. For vSphere to succeed, VMware has to address two concerns: performance, and license cost. With the current vRAM licensing model, no company would use vSphere in large scale Hadoop deployment unless it already over-buy the capacity, or has unlimited enterprise license which I am not sure VMware offers. Having said, it worths checking the Serengeti project VMware just open sourced under Apache license.
- Business Intelligence Applications. Companies like IBM, Teradata Aster, Tableau, Pentaho, etc. are there. We definitely need more applications, especially those are ready to use out of box.
I think Hadoop technology has grown to a tipping point from which wider adoption is ready to take place. It has attracted an enthusiastic community that is very excited about its potential and working hard toward that goal.
As I can see, the community needs to look beyond the platform and grow the tooling and application ecosystem so that the barrier to use Hadoop in enterprises is removed or lowered. This is not an easy task at all, neither is it a mission impossible.
I will write more on how Hadoop community can learn from the success of virtualization in a separate article.