Finding the files filling up a Linux server’s filesystems

Make some room

Running out of disk space? On a laptop, that’s an inconvenience. But on a server, you’re looking at outages, shutdowns, and bizarre behavior. In fact, having only a few percent free space can slow input/output (I/O) operations and severely degrade performance of all running threads and tasks. So what’s a dev to do? Think big. I’m talking largest_item.

When is it time for a filesystem cleaning?

Little disk space is a given. But when I see a report back from the command-line df program like this, I know for sure it’s time for filesystem cleaning:

Filesystem  1K-blocks        Used   Available   Use%   Mounted on
/dev/xvdp   262016000   212496832    49519168    82%   /

Starting with a script for finding files

You’ll first need to know what is on the virtual or dedicated server’s filesystem. This makes deciding which items to keep, and which to discard, quick work.

I typically set thresholds at 10% and 20% of the total space available on the partition in question. One way to look for files worth deleting is to run a script such as largest_item.sh. The following is a small script I wrote.

When I run it as:

largest_item.sh ~claird/work

I see this, for instance, at the command line on one Linux server:

/home/claird/work/clients fills 9103300 kilobytes.
/home/claird/work/clients/Client5 fills 2742752 kilobytes.
/home/claird/work/clients/Client5/Project3 fills 1544156 kilobytes.
/home/claird/work/clients/Client5/Project3/AnnualReport fills 524188 kilobytes.
/home/claird/work/clients/Client5/Project3/AnnualReport/Final2012.1 fills 468032 kilobytes.

This tells me that, of the more than nine gigabytes devoted to /clients/ folders, nearly 30% are taken up by one of them, specifically /Client5/. If I have any hope of clearing meaningful space on this particular filesystem, it will have to involve Client5, and probably Client5/Project3. While a 9 GB file isn’t considered large by server standards, it frequently happens that single files in some kind of runaway will become hundreds of gigabytes large, and need human attention. largest_item.sh is a quick way to locate such surprises.

Avoiding mistakes with largest_item.sh

As a starting point, nothing I’ve seen beats the approach of having the filesystem report on its largest subfolder, the largest subfolder within that subfolder, and so on. Note that the sizes reported here are the contents of subfolders, including all their nested subfolders. Server admins too often make the mistake of declaring that a particular folder is nearly empty, when it actually holds many gigabytes of files — all because the largest files are out of view in subfolders a few levels deeper in the hierarchy. largest_item.sh helps ensure you don’t make that mistake.

You can run largest_item.sh for yourself by saving the following source into a shell script, and, of course, setting its execution bit:

 #!/bin/bash
    #
    # Source of largest_item.sh
    #
    # Execution of largest_item.sh with no command-line arguments
    # reports on the contents of the current directory.  If you
    # invoke "largest_item.sh $SOME_FOLDER", you'll receive a 
    # report on $SOME_FOLDER.
    
    # Simplified version:  not optimized for deep hierarchies,
    # doesn't report on permission errors, and so on.

    if [ "$1" == '' ]
    then
        CWD=`pwd`
    else
        CWD=$1
    fi
    
    while [ -d "$CWD" ]
    do
        cd $CWD
        RESULT=`du -s * 2>/dev/null  | sort -rn | head -1`
        SIZE=`echo $RESULT | awk '{print $1}'`
        if [ "$SIZE" == "" ]
	then
	    SIZE=0
        fi
        FOLDER=`echo $RESULT | awk '{print $2}'`
        CWD="$CWD/$FOLDER"
        echo "$CWD fills $SIZE kilobytes."
    done

You can run largest_item.sh on any Unix-like operating system. If you have thousands, or tens of thousands of files, in some of your folders, largest_item.sh (and any other processes you’re likely to use) might take up to a couple of minutes to return.

Now, largest_item.sh doesn’t dictate your use of it. In the example above, Final2012.1 is the largest file in the largest folder in the project. That doesn’t mean I’ll necessarily delete it today. One of the benefits of largest_item.sh is that I can focus my attention up the filesystem hierarchy a level or two, when that is appropriate. With a clear picture of which folders are largest, I might decide to open the /home/claird/work/clients/Client5/Project3 directory and examine it in more detail.

Practice makes perfect

In a perfect world, your servers would all be running so smoothly that you’d never have to intervene “manually.” Everything would be in its place, and automatic processes would ensure that you periodically make necessary space, archive off old files, and keep your entire host within safe operating limits. But no human achieves all that without practice. The next time your server exceeds the capacity threshold you set for it, use largest_item.sh to make your way quickly to the files whose removal will clear substantial space for you.

Got tips on the topic of disk space? Share your experience or query the community with your questions in the comments below.

Image by: nathangibbs via Compfight cc

Cameron Laird
A full-time programmer and project leader, Cameron Laird also has regularly reported on technical topics since he began reading farm commodity prices for early-morning commercial radio decades ago. From his base on the Texas Gulf Coast, he studies logfiles and support tickets in at least half a dozen time zones.