Harmful Big Data

In one of my recent projects, I got into a “big data” issue. One of the open source components emits so many logs that it quickly fills a hard disk. After isolating problem, I found huge number of log entries by the “find” command in a single log file whose size exceeds 50G – too big data for most system to handle.

The following is an example log entry in the log file:

Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.

find: File system loop detected;
is part of the same file system loop as

With detailed analysis, I traced down the script that builds an index. Because of a permission issue, a “cd” command failed and the following “find” command just searched for files from file system root. Normally it’s not too bad, but the find command has an “-L” option. That options tells the command that whenever sees a symbolic link follow it. As it’s very common to have symbolic links from file system root and down, it’s a sure source to catch a fire on most systems. To try it out, you can type the following command and see what is the outputs.

# find / -L -name doublecloud.txt

There could be different solutions to this problem. The obvious one is to fix the permission problem – if the cd command succeeded, the find would just work in a directory that we know won’t cause this huge log data. But, it cannot be guaranteed, and in reality anything can happen. When it happens, the software can fail but should not generate this huge log which may potentially fails the whole system. In other words, we have to make the software robust.

For that, we can either remove the -L option from the find command in the script. This should solves the problem for sure, but could affect the functionality if there are symbolic links in the target directory for indexing and these files will be ignored. So this solution is not ideal.

Another solution is to fail the whole script totally because it’s meaningless to find from the root anyway. To do that, we can add one line “set -e” command as follows:

set -e
echo "[`date`]  building index..."
find -L . -name '*.wsp' | perl -pe 's!^[^/]+/(.+)\.wsp$!$1!; s!/!.!g' > $TMP_INDEX

Reading to this point, you may be curious what the open source project is. It’s Graphite for data visualization. It has a sub-component called whisper for storing data. The above script is from /opt/graphite/bin/build-index.sh, which builds index from “/var/graphite/storage/whisper”. When it fails, it fills the /var/log/uwsgi/app/graphite.log.

This entry was posted in Applications & Tools and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


    My company has created products like vSearch ("Super vCenter"), vijavaNG APIs, EAM APIs, ICE tool. We also help clients with virtualization and cloud computing on customized development, training. Should you, or someone you know, need these products and services, please feel free to contact me: steve __AT__ doublecloud.org.

    Me: Steve Jin, VMware vExpert who authored the VMware VI and vSphere SDK by Prentice Hall, and created the de factor open source vSphere Java API while working at VMware engineering. Companies like Cisco, EMC, NetApp, HP, Dell, VMware, are among the users of the API and other tools I developed for their products, internal IT orchestration, and test automation.