There’s no nicer way to say it… I’m running out of disk space. I have three options: buy a larger hard drive, delete some files to free up space, or compress some of the data. Buying a larger hard drive is the best option in the long term but “in the long term, we’re all dead”
and deleting files is painful for me… I’m a serial pack rat. So I decided to explore compression as a way out of my disk space headaches. First, I had to find the most efficient compression algorithm, a task I soon found out is not easy. I read several blogs and websites and everybody had something good to say about their favorite algorithm. But one thing was clear, the GZIP, BZIP2 and LZMA compression algorithms were leading the pack. To satisfy my own curiosity and determine for myself which was the most efficient, I decided to run some benchmarks. To be honest, I’ve been hearing some good things about the LZMA compression algorithm so I was hoping it would live up to the hype.
These benchmarks were conducted on a 2.53 GHz processor with 2GB RAM and a 5400 RPM Seagate Barracuda IDE hard disk. I also throttled the algorithms for maximum compression.
Version information:
gzip 1.3.12
bzip2 1.0.5
LZMA 4.32.0beta3
LZMA SDK 4.43
For starters, I threw an empty 1GiB file with nothing in it but binary zeros at them.
$ dd if=/dev/zero of=test.zero -bs=1024M -count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 187.978 s, 5.7 MB/s
Now the fun starts.
GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 test.zero > test.gz
12.36 seconds CPU 99%
BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 test.zero > test.bz2
32.07 seconds CPU 98%
LZMA
$ /usr/bin/time -f “%U seconds CPU %P” lzma -c9 test.zero > test.lzma
873.79 seconds CPU 96%
So what kind of compression ratios are we talking about here?
$ ls -lh test.zero*
-rw-r–r– 1 kafui kafui 1.0G 2009-03-25 12:01 test.zero
-rw-r–r– 1 kafui kafui 1018K 2009-03-25 12:51 test.gz
-rw-r–r– 1 kafui kafui 148K 2009-03-25 13:10 test.lzma
-rw-r–r– 1 kafui kafui 785 2009-03-25 12:52 test.bz2
GZIP squeezed 1 gigabyte into about 1 megabyte in about 12 seconds… nice. LZMA’s compression ratio was very impressive; it squeezed 1 gigabyte into 148 kilobytes BUT in 873.79 seconds… that’s almost 15 minutes. BZIP2 was absolutely cool… 1Gib down to 785 bytes in 32 seconds! The clear winner here however is BZIP2. It has the highest compression ratio with acceptable time requirements. Now on to tests with real data.
For the next test, I decided to compress the contents of my /opt folder. To simplify things, I created a tar archive of the folder first.
$ sudo tar -cf opt.tar /opt
[sudo] password for kafui:
tar: Removing leading `/’ from member names
tar: Removing leading `/’ from hard link targets
$ ls -lh opt.tar
-rw-r–r– 1 root root 120M 2009-03-25 15:48 opt.tar
So we’re working with 120MB of data. On to the tests:
GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 opt.tar > opt.tar.gz
19.42 seconds CPU 89%
BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 opt.tar > opt.tar.bz2
30.76 seconds CPU 93%
LZMA
/usr/bin/time -f “%U seconds CPU %P” lzma -c9 opt.tar > opt.tar.lzma
132.21 seconds CPU 92%
$ ls -lh opt.tar*
-rw-r–r– 1 kafui kafui 120M 2009-03-25 15:48 opt.tar
-rw-r–r– 1 kafui kafui 39M 2009-03-25 15:56 opt.tar.gz
-rw-r–r– 1 kafui kafui 36M 2009-03-25 16:09 opt.tar.bz2
-rw-r–r– 1 kafui kafui 25M 2009-03-25 16:16 opt.tar.lzma
Once again, GZIP was the fastest and got 120MB down to 39MB in 19.42 seconds. BZIP2 reduced 120MB to 36MB but took 11.34 seconds longer than GZIP. LZMA delivered the best compression with 25MB but took 132.21 seconds. It appears there are trade-offs with each compression method. GZIP is fast but its compression ratio is the lowest of the three. LZMA (depending on the data) delivers the most efficient compression ratio but takes too much time to do so. BZIP2 strikes a balance between efficient compression and speed… it’s way faster than LZMA and can actually deliver better compression. LZMA just does not live up to the hype.
Unfortunately, these benchmarks were of no use to me because about 140GiB of my data is made up of AVIs, PNGs and JPEGs. These formats are already compressed so there isn’t much room for further compression. But for what it’s worth, I gave the algorithms a spin anyway.
$ ls -lh The.Big.Bang.Theory.S01E10.avi
-rwxrwxrwx 1 kafui kafui 175M 2008-04-18 20:14 The.Big.Bang.Theory.S01E10.avi
GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.gz
10.94 seconds CPU 78%
BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.bz2
55.15 seconds CPU 94%
LZMA
$ /usr/bin/time -f “%U seconds CPU %P” lzma -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.lzma
138.74 seconds CPU 93%
$ ls -lh The.Big.Bang.Theory.S01E10.avi*
-rwxr-xr-x 1 kafui kafui 175M 2009-03-25 16:34 The.Big.Bang.Theory.S01E10.avi
-rw-r–r– 1 kafui kafui 173M 2009-03-25 16:35 The.Big.Bang.Theory.S01E10.avi.gz
-rw-r–r– 1 kafui kafui 173M 2009-03-25 16:39 The.Big.Bang.Theory.S01E10.avi.bz2
-rw-r–r– 1 kafui kafui 174M 2009-03-25 16:43 The.Big.Bang.Theory.S01E10.avi.lzma
GZIP and BZIP both got the 175MB episode of The Big Bang Theory down to 173MB; BZIP2 of course took 44.12 seconds longer. And LZMA got it down by only 1MB but in 138.74 seconds. As you can see, it doesn’t make much sense for me to compress my videos and pictures… not with those compression ratios. So it seems I’ll just have to cough up the cedis for a new hard drive.