There’s no nicer way to say it… I’m running out of disk space. I have three options: buy a larger hard drive, delete some files to free up space, or compress some of the data. Buying a larger hard drive is the best option in the long term but “in the long term, we’re all dead”
and deleting files is painful for me… I’m a serial pack rat. So I decided to explore compression as a way out of my disk space headaches. First, I had to find the most efficient compression algorithm, a task I soon found out is not easy. I read several blogs and websites and everybody had something good to say about their favorite algorithm. But one thing was clear, the GZIP, BZIP2 and LZMA compression algorithms were leading the pack. To satisfy my own curiosity and determine for myself which was the most efficient, I decided to run some benchmarks. To be honest, I’ve been hearing some good things about the LZMA compression algorithm so I was hoping it would live up to the hype.
These benchmarks were conducted on a 2.53 GHz processor with 2GB RAM and a 5400 RPM Seagate Barracuda IDE hard disk. I also throttled the algorithms for maximum compression.
Version information:
gzip 1.3.12
bzip2 1.0.5
LZMA 4.32.0beta3
LZMA SDK 4.43
For starters, I threw an empty 1GiB file with nothing in it but binary zeros at them.
$ dd if=/dev/zero of=test.zero -bs=1024M -count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 187.978 s, 5.7 MB/s
Now the fun starts.
GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 test.zero > test.gz
12.36 seconds CPU 99%
BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 test.zero > test.bz2
32.07 seconds CPU 98%
LZMA
$ /usr/bin/time -f “%U seconds CPU %P” lzma -c9 test.zero > test.lzma
873.79 seconds CPU 96%
So what kind of compression ratios are we talking about here?
$ ls -lh test.zero*
-rw-r–r– 1 kafui kafui 1.0G 2009-03-25 12:01 test.zero
-rw-r–r– 1 kafui kafui 1018K 2009-03-25 12:51 test.gz
-rw-r–r– 1 kafui kafui 148K 2009-03-25 13:10 test.lzma
-rw-r–r– 1 kafui kafui 785 2009-03-25 12:52 test.bz2
GZIP squeezed 1 gigabyte into about 1 megabyte in about 12 seconds… nice. LZMA’s compression ratio was very impressive; it squeezed 1 gigabyte into 148 kilobytes BUT in 873.79 seconds… that’s almost 15 minutes. BZIP2 was absolutely cool… 1Gib down to 785 bytes in 32 seconds! The clear winner here however is BZIP2. It has the highest compression ratio with acceptable time requirements. Now on to tests with real data.
For the next test, I decided to compress the contents of my /opt folder. To simplify things, I created a tar archive of the folder first.
$ sudo tar -cf opt.tar /opt
[sudo] password for kafui:
tar: Removing leading `/’ from member names
tar: Removing leading `/’ from hard link targets
$ ls -lh opt.tar
-rw-r–r– 1 root root 120M 2009-03-25 15:48 opt.tar
So we’re working with 120MB of data. On to the tests:
GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 opt.tar > opt.tar.gz
19.42 seconds CPU 89%
BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 opt.tar > opt.tar.bz2
30.76 seconds CPU 93%
LZMA
/usr/bin/time -f “%U seconds CPU %P” lzma -c9 opt.tar > opt.tar.lzma
132.21 seconds CPU 92%
$ ls -lh opt.tar*
-rw-r–r– 1 kafui kafui 120M 2009-03-25 15:48 opt.tar
-rw-r–r– 1 kafui kafui 39M 2009-03-25 15:56 opt.tar.gz
-rw-r–r– 1 kafui kafui 36M 2009-03-25 16:09 opt.tar.bz2
-rw-r–r– 1 kafui kafui 25M 2009-03-25 16:16 opt.tar.lzma
Once again, GZIP was the fastest and got 120MB down to 39MB in 19.42 seconds. BZIP2 reduced 120MB to 36MB but took 11.34 seconds longer than GZIP. LZMA delivered the best compression with 25MB but took 132.21 seconds. It appears there are trade-offs with each compression method. GZIP is fast but its compression ratio is the lowest of the three. LZMA (depending on the data) delivers the most efficient compression ratio but takes too much time to do so. BZIP2 strikes a balance between efficient compression and speed… it’s way faster than LZMA and can actually deliver better compression. LZMA just does not live up to the hype.
Unfortunately, these benchmarks were of no use to me because about 140GiB of my data is made up of AVIs, PNGs and JPEGs. These formats are already compressed so there isn’t much room for further compression. But for what it’s worth, I gave the algorithms a spin anyway.
$ ls -lh The.Big.Bang.Theory.S01E10.avi
-rwxrwxrwx 1 kafui kafui 175M 2008-04-18 20:14 The.Big.Bang.Theory.S01E10.avi
GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.gz
10.94 seconds CPU 78%
BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.bz2
55.15 seconds CPU 94%
LZMA
$ /usr/bin/time -f “%U seconds CPU %P” lzma -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.lzma
138.74 seconds CPU 93%
$ ls -lh The.Big.Bang.Theory.S01E10.avi*
-rwxr-xr-x 1 kafui kafui 175M 2009-03-25 16:34 The.Big.Bang.Theory.S01E10.avi
-rw-r–r– 1 kafui kafui 173M 2009-03-25 16:35 The.Big.Bang.Theory.S01E10.avi.gz
-rw-r–r– 1 kafui kafui 173M 2009-03-25 16:39 The.Big.Bang.Theory.S01E10.avi.bz2
-rw-r–r– 1 kafui kafui 174M 2009-03-25 16:43 The.Big.Bang.Theory.S01E10.avi.lzma
GZIP and BZIP both got the 175MB episode of The Big Bang Theory down to 173MB; BZIP2 of course took 44.12 seconds longer. And LZMA got it down by only 1MB but in 138.74 seconds. As you can see, it doesn’t make much sense for me to compress my videos and pictures… not with those compression ratios. So it seems I’ll just have to cough up the cedis for a new hard drive.


Pirate, shame on you!
(anyway it is a funny show).
Nice benchmarks, why not try rar too, i know it isn’t free but it would nice to see how it stands with the others.
By: jjss on March 25, 2009
at 11:56 pm
Nice text
And as above post would be nice to see 7z too
regards
By: mudrii on March 26, 2009
at 2:02 am
Sorry to say, but it was pointless from the very beginning. Current lossy compression algorithm used on any media file are very efficient (as the name suggests) in data compression. Otherwise they wouldn’t be used at all. At least, I hope you had a good time testing the compression algorithms
By: Tosuja on March 26, 2009
at 1:00 pm
7z is more container than some special algorithm, in fact it use LZMA.
I’m missing here another important point of view – memory consumption for unpacking. LZMA is in that way very good and thus very suitable for embedded environments.
By: Sleep_Walker on March 26, 2009
at 1:33 pm
@jjss LOL
I’ll try out rar over the weekend.
@mudrii I think Sleep_Walker’s response hits the nail on the head. The 7z format is basically an implementation of the LZMA algorithm.
@Tosuja Yeah, I guess I was bored and these benchmarks allowed me to kill an hour or so.
@Sleep_Walker I probably will have to update the post with memory benchmarks.
By: Odzangba on March 26, 2009
at 7:57 pm
Lzma also has very low decompression speed compared to the compression time so that its very suited for distribution where you compress once and decompress many times.
By: Thomas on May 13, 2009
at 11:02 pm
Your test at -c9 isn’t really all that informative, as each algorithm has a different tradeoff of compression ration vs. time to compress. LZMA provides better compression ratios; can you change the parameters so that it takes less time than bzip2 -c9, AND still provides better compression ratios?
Also as Thomas mentioned above, LZMA is a much faster decompressor than bzip2. Read-only data only needs to be compressed once, but may be decompressed many times.
By: Max on July 23, 2009
at 8:52 pm
@Tosuja: Of course, in the real world, one wouldn’t use gzip or bzip2 to compress media files which are already compressed, but they serve as one good source of data for testing these compression algorithms. More complete testing would also include data similar to the intended application in the real-world. For example, compressing ASCII text in httpd server log files is different than compressing binary executables.
@Odzangba : Another consideration is the time it takes to UNcompress the data. I am working on a small embedded system that uses a PPC-based processor at 400 MHz that barely hits 500 MIPS (similar to an original Pentium at 200 MHz from about 1995). This system runs Ubuntu Linux off of a 4 GB flash drive. We keep a backup copy of the root filesystem in a separate partition on the flash drive. This backup image allows us to recover the system if the root filesystem becomes corrupt. We can make the recovery image about 40% smaller by compressing it, thus saving almost 600 MB, which is a big chunk out of our 4 GB total.
Since the system has such a small flash drive for storage you might guess that we would be most interested in the best compression of the recovery image. In reality, bzip2 does not save much more than gzip. The bzip2 compressed copy of recovery image was barely 4% smaller than the gzip compressed copy. It turns out that the speed to uncompress the recovery image is a more important factor to us. Remember, our processor is slow! And if push came to shove, adding a bigger flash drive is far cheaper than getting a faster processor (which would also consume more power and add more heat).
While comparing the time it took to uncompress the recovery image, I found that gunzip is over 2 times faster than bunzip2. It takes almost an hour to restore the system using bunzip2; whereas, it takes less than 25 minutes to restore the system using gunzip. Granted, our recovery feature may be used very rarely (hopefully never!), but which version do you think are we going to ship to our customer?
By: Noah Spurrier on July 31, 2009
at 8:43 am