Harmful Big Data
In one of my recent projects, I got into a “big data” issue. One of the open source components emits so many logs that it quickly fills a hard disk. After isolating problem, I found huge number of log entries by the “find” command in a single log file whose size exceeds 50G – too big data for most system to handle.
The following is an example log entry in the log file:
Time to learn how to "Google" and manage your VMware and clouds in a fast and secureHTML5 App
find: File system loop detected; `./sys/devices/platform/reg-dummy/subsystem/devices/serial8250/tty/ttyS2/subsystem/console/subsystem' is part of the same file system loop as `./sys/devices/platform/reg-dummy/subsystem/devices/serial8250 /tty/ttyS2/subsystem'.
With detailed analysis, I traced down the script that builds an index. Because of a permission issue, a “cd” command failed and the following “find” command just searched for files from file system root. Normally it’s not too bad, but the find command has an “-L” option. That options tells the command that whenever sees a symbolic link follow it. As it’s very common to have symbolic links from file system root and down, it’s a sure source to catch a fire on most systems. To try it out, you can type the following command and see what is the outputs.
# find / -L -name doublecloud.txt
There could be different solutions to this problem. The obvious one is to fix the permission problem – if the cd command succeeded, the find would just work in a directory that we know won’t cause this huge log data. But, it cannot be guaranteed, and in reality anything can happen. When it happens, the software can fail but should not generate this huge log which may potentially fails the whole system. In other words, we have to make the software robust.
For that, we can either remove the -L option from the find command in the script. This should solves the problem for sure, but could affect the functionality if there are symbolic links in the target directory for indexing and these files will be ignored. So this solution is not ideal.
Another solution is to fail the whole script totally because it’s meaningless to find from the root anyway. To do that, we can add one line “set -e” command as follows:
set -e cd $WHISPER_DIR touch $INDEX_FILE echo "[`date`] building index..." find -L . -name '*.wsp' | perl -pe 's!^[^/]+/(.+)\.wsp$!$1!; s!/!.!g' > $TMP_INDEX
Reading to this point, you may be curious what the open source project is. It’s Graphite for data visualization. It has a sub-component called whisper for storing data. The above script is from /opt/graphite/bin/build-index.sh, which builds index from “/var/graphite/storage/whisper”. When it fails, it fills the /var/log/uwsgi/app/graphite.log.