Big Data or Big Junk?
Two weeks I got a problem with my blog site. Somehow I could not post to announce the GA of open source ViJava API for vSphere 5.1 there. After searching and researching, I found out that the wp_commentmeta table was filled with extra amount of data that exceeded the per database limit of 100MB imposed by my service provider. While I was enjoying Thanksgiving holiday, some spammers and their robots worked diligently posting thousands of spam comments on my site.
After deleting most of the records in the table, the database didn’t shrink its size at all even after I tried to run the optimize command (I was actually denied to run the command). Then I called technical support and got it fixed the second day. I guess there may be batch job by the service provider that run the optimize command for every database when the server in low workload during a day, therefore it should work even without calling tech support. Anyway I had no interest to find out more as I found out the gentleman behind the support line knew little about database behind WordPress.
Although the problem was resolved and my blog came back normal, it made me think more the data: is it really the bigger the better? The spam comments used most of my blog database, but adversely impacted my blog to the extent that I cannot get my work done. Despite its size, I wouldn’t call it big data, but rather big junk. For me, the spam comments are not only useless but also harmful.
To be fair, the spam was saved for a reason. The Akismet plugin for WordPress saved them in the database. I assume they have been sent to the central server for analyzing spam patterns. For Akismet, they’re valuable to some extent. If not really a spam, a comment may return to approval queue. For that, the comment is saved I guess. I think given almost all the comments sent for validation are truly spams, it does not make sense to save their content in the blog – just save an index and retrieve the content otherwise. Anyway, let’s not go far on the Akismet plugin here.
As of today, there are more machine generated data than human generated data. We all know machine can generate data so fast that they can fill storage easily and duplicate on different servers across networks.
While we are talking about big data and getting excited about the size, I think it’s important to think more about the data quality than size. Without paying attention on the quality, big data buys you nothing or may hurt you in some way. At minimum, you need to store it and that could be a big cost for big data. By then, the big data is really big junk.
Boiling down to the details, it’s all about what data to collect, how to collect, how to save, how to transform, how to analyze and report, etc. The big size should not be an end but a means for what you want to achieve.