I went to LinkedIn last Wednesday for a tech talk by UC Berkeley professor Joseph Hellerstein on Programming for Distributed Consistency: CALM and Bloom. This is indeed a highly specialized topic, so I am not going to talk about the details. Should you be interested in the new programming language Bloom, you can check the web site (http://bloom-lang.org).
What I will discuss here is the data processing tool the speaker introduced at the end of his talk. The tool is called Wrangler developed by Stanford University VIS group. According to the project page, “Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data.”
Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.
The speaker quickly demoed how to easily convert a text file to a tabular data set. When I tried it later at home, it doesn’t seem so easy to use. Lack of practice, I guess.
This tool reminds me of the problem of Hadoop today. Although it’s a data processing tool, it remains to be a game for developers. Making it easy for business users like data scientists who may not know Java programming or PIG will surely accelerate the adoption of the open source project. More importantly, it will bring in larger revenues by helping people who are more willing to pay than developers.
As I mentioned in my Hadoop Summit summary, Datameer has done a decent job to use Excel like Web front end to hide the complexity. That significantly reduces the learning curves. At the same time, it’s limited by Excel processing model and constraints. In other words, it may not be flexible enough to handle real world cases.
In my understanding, the Wrangler tool has the potential to be further developed as a more generic Hadoop front end for business users. It won’t be as flexible as Java but probably sophisticated enough for most use cases.