Home > Big Data > Big Data or Big Junk?

Big Data or Big Junk?

December 8th, 2012 Leave a comment Go to comments

Two weeks I got a problem with my blog site. Somehow I could not post to announce the GA of open source ViJava API for vSphere 5.1 there. After searching and researching, I found out that the wp_commentmeta table was filled with extra amount of data that exceeded the per database limit of 100MB imposed by my service provider. While I was enjoying Thanksgiving holiday, some spammers and their robots worked diligently posting thousands of spam comments on my site.

After deleting most of the records in the table, the database didn’t shrink its size at all even after I tried to run the optimize command (I was actually denied to run the command). Then I called technical support and got it fixed the second day. I guess there may be batch job by the service provider that run the optimize command for every database when the server in low workload during a day, therefore it should work even without calling tech support. Anyway I had no interest to find out more as I found out the gentleman behind the support line knew little about database behind WordPress.

Time to learn how to "Google" and manage your VMware and clouds in a fast and secure


Although the problem was resolved and my blog came back normal, it made me think more the data: is it really the bigger the better? The spam comments used most of my blog database, but adversely impacted my blog to the extent that I cannot get my work done. Despite its size, I wouldn’t call it big data, but rather big junk. For me, the spam comments are not only useless but also harmful.

To be fair, the spam was saved for a reason. The Akismet plugin for WordPress saved them in the database. I assume they have been sent to the central server for analyzing spam patterns. For Akismet, they’re valuable to some extent. If not really a spam, a comment may return to approval queue. For that, the comment is saved I guess. I think given almost all the comments sent for validation are truly spams, it does not make sense to save their content in the blog – just save an index and retrieve the content otherwise. Anyway, let’s not go far on the Akismet plugin here.

As of today, there are more machine generated data than human generated data. We all know machine can generate data so fast that they can fill storage easily and duplicate on different servers across networks.

While we are talking about big data and getting excited about the size, I think it’s important to think more about the data quality than size. Without paying attention on the quality, big data buys you nothing or may hurt you in some way. At minimum, you need to store it and that could be a big cost for big data. By then, the big data is really big junk.

Boiling down to the details, it’s all about what data to collect, how to collect, how to save, how to transform, how to analyze and report, etc. The big size should not be an end but a means for what you want to achieve.

Categories: Big Data Tags: ,
  1. December 8th, 2012 at 18:46 | #1

    Big Data or Big Junk? http://t.co/40pTbRHO via @sjin2008

  2. December 8th, 2012 at 19:24 | #2

    Big Data or Big Junk? (DoubleCloud) http://t.co/jqiBqEcQ

  3. December 8th, 2012 at 19:52 | #3

    Big Data or Big Junk? (DoubleCloud) http://t.co/qHtyT9t2

  4. December 9th, 2012 at 20:13 | #4

    Big Data or Big Junk? http://t.co/Mu0Sc7r9 via @zite

  5. December 10th, 2012 at 12:31 | #5

    Quality vs size: #BigData or Big Junk? http://t.co/Hj9ZKmnL

  6. December 25th, 2012 at 23:40 | #6

    Big Data or Big Junk? – http://t.co/YI4SnVAY http://t.co/YI4SnVAY

  7. December 26th, 2012 at 10:37 | #7

    Big Data or Big Junk? – http://t.co/1MekEoD2 http://t.co/1MekEoD2

  8. January 18th, 2013 at 05:55 | #8

    Big Data or Big Junk? – http://t.co/Bei6UPNL http://t.co/Bei6UPNL

  9. January 29th, 2013 at 21:32 | #9

    Spammy comments are always a problem on any WordPress. Luckily, you can install the Akismet plugin on any Wodpress blog (like yours!). I would recommend that you set the comments to be all moderated so that you prevent spammy comments to appear. While a healthy amount of comments is necessary to develop a healthy community, it is important to keep the trash away. Do you agree?

  10. January 30th, 2013 at 01:22 | #10

    Thanks Damian! I had Akismet installed – in fact it’s installed by default. The problem was that it saves a copy of every spam message and only clean it up after certain days. I will try what you suggested.

  11. Soldier
    February 10th, 2013 at 21:58 | #11

    Interesting perspective. I often hear how we are now creating more new data per day than was created per year ten years ago.

    I hear that and think, yeah but is that of any value?

    There is the spam you mentioned, the empty texts and emails of “wuz up?” and junk like that.
    Not exactly the next Einstein or Shakespeare.

    Heck even my own post here is mostly just bellyaching observation………….. :)

  12. December 17th, 2013 at 13:20 | #12

    I think this is one of the most important information for me.
    And i’m glad reading your article. But wanna remark on few general things, The website style is great,
    the articles is really great : D. Good job, cheers

  1. December 9th, 2012 at 17:25 | #1