Big Data or Big Junk?

Two weeks I got a problem with my blog site. Somehow I could not post to announce the GA of open source ViJava API for vSphere 5.1 there. After searching and researching, I found out that the wp_commentmeta table was filled with extra amount of data that exceeded the per database limit of 100MB imposed by my service provider. While I was enjoying Thanksgiving holiday, some spammers and their robots worked diligently posting thousands of spam comments on my site.

After deleting most of the records in the table, the database didn’t shrink its size at all even after I tried to run the optimize command (I was actually denied to run the command). Then I called technical support and got it fixed the second day. I guess there may be batch job by the service provider that run the optimize command for every database when the server in low workload during a day, therefore it should work even without calling tech support. Anyway I had no interest to find out more as I found out the gentleman behind the support line knew little about database behind WordPress.

Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.

Although the problem was resolved and my blog came back normal, it made me think more the data: is it really the bigger the better? The spam comments used most of my blog database, but adversely impacted my blog to the extent that I cannot get my work done. Despite its size, I wouldn’t call it big data, but rather big junk. For me, the spam comments are not only useless but also harmful.

To be fair, the spam was saved for a reason. The Akismet plugin for WordPress saved them in the database. I assume they have been sent to the central server for analyzing spam patterns. For Akismet, they’re valuable to some extent. If not really a spam, a comment may return to approval queue. For that, the comment is saved I guess. I think given almost all the comments sent for validation are truly spams, it does not make sense to save their content in the blog – just save an index and retrieve the content otherwise. Anyway, let’s not go far on the Akismet plugin here.

As of today, there are more machine generated data than human generated data. We all know machine can generate data so fast that they can fill storage easily and duplicate on different servers across networks.

While we are talking about big data and getting excited about the size, I think it’s important to think more about the data quality than size. Without paying attention on the quality, big data buys you nothing or may hurt you in some way. At minimum, you need to store it and that could be a big cost for big data. By then, the big data is really big junk.

Boiling down to the details, it’s all about what data to collect, how to collect, how to save, how to transform, how to analyze and report, etc. The big size should not be an end but a means for what you want to achieve.

This entry was posted in Big Data and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Posted December 8, 2012 at 6:46 pm | Permalink

    Big Data or Big Junk? via @sjin2008

  2. Posted December 8, 2012 at 7:24 pm | Permalink

    Big Data or Big Junk? (DoubleCloud)

  3. Posted December 8, 2012 at 7:52 pm | Permalink

    Big Data or Big Junk? (DoubleCloud)

  4. Posted December 9, 2012 at 8:13 pm | Permalink

    Big Data or Big Junk? via @zite

  5. Posted December 10, 2012 at 12:31 pm | Permalink

    Quality vs size: #BigData or Big Junk?

  6. Posted December 25, 2012 at 11:40 pm | Permalink

    Big Data or Big Junk? –

  7. Posted December 26, 2012 at 10:37 am | Permalink

    Big Data or Big Junk? –

  8. Posted January 18, 2013 at 5:55 am | Permalink

    Big Data or Big Junk? –

  9. Posted January 29, 2013 at 9:32 pm | Permalink

    Spammy comments are always a problem on any WordPress. Luckily, you can install the Akismet plugin on any Wodpress blog (like yours!). I would recommend that you set the comments to be all moderated so that you prevent spammy comments to appear. While a healthy amount of comments is necessary to develop a healthy community, it is important to keep the trash away. Do you agree?

  10. Posted January 30, 2013 at 1:22 am | Permalink

    Thanks Damian! I had Akismet installed – in fact it’s installed by default. The problem was that it saves a copy of every spam message and only clean it up after certain days. I will try what you suggested.

  11. Soldier
    Posted February 10, 2013 at 9:58 pm | Permalink

    Interesting perspective. I often hear how we are now creating more new data per day than was created per year ten years ago.

    I hear that and think, yeah but is that of any value?

    There is the spam you mentioned, the empty texts and emails of “wuz up?” and junk like that.
    Not exactly the next Einstein or Shakespeare.

    Heck even my own post here is mostly just bellyaching observation………….. :)

  12. Posted December 17, 2013 at 1:20 pm | Permalink

    I think this is one of the most important information for me.
    And i’m glad reading your article. But wanna remark on few general things, The website style is great,
    the articles is really great : D. Good job, cheers

One Trackback

  • By Tofa IT » Big Data or Big Junk? on December 9, 2012 at 5:25 pm

    […] than the size of big data because the size is not the end but a means to what we want to achieve.Big Data or Big Junk? originally appeared on DoubleCloud by Steve Jin, author of VMware VI and vSphere SDK (Prentice […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


    My company has created products like vSearch ("Super vCenter"), vijavaNG APIs, EAM APIs, ICE tool. We also help clients with virtualization and cloud computing on customized development, training. Should you, or someone you know, need these products and services, please feel free to contact me: steve __AT__

    Me: Steve Jin, VMware vExpert who authored the VMware VI and vSphere SDK by Prentice Hall, and created the de factor open source vSphere Java API while working at VMware engineering. Companies like Cisco, EMC, NetApp, HP, Dell, VMware, are among the users of the API and other tools I developed for their products, internal IT orchestration, and test automation.