Posted by: Odzangba | March 25, 2009

GZIP vs. BZIP2 vs. LZMA

There’s no nicer way to say it… I’m running out of disk space. I have three options: buy a larger hard drive, delete some files to free up space, or compress some of the data. Buying a larger hard drive is the best option in the long term but “in the long term, we’re all dead” 😀 and deleting files is painful for me… I’m a serial pack rat. So I decided to explore compression as a way out of my disk space headaches. First, I had to find the most efficient compression algorithm, a task I soon found out is not easy. I read several blogs and websites and everybody had something good to say about their favorite algorithm. But one thing was clear, the GZIP, BZIP2 and LZMA compression algorithms were leading the pack. To satisfy my own curiosity and determine for myself which was the most efficient, I decided to run some benchmarks. To be honest, I’ve been hearing some good things about the LZMA compression algorithm so I was hoping it would live up to the hype.

These benchmarks were conducted on a 2.53 GHz dual-core processor with 2GB RAM and a 5400 RPM Seagate Barracuda IDE hard disk. I also throttled the algorithms for maximum compression.

Version information:
gzip 1.3.12
bzip2 1.0.5
LZMA 4.32.0beta3
LZMA SDK 4.43

For starters, I threw an empty 1GiB file with nothing in it but binary zeros at them.

$ dd if=/dev/zero of=test.zero -bs=1024M -count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 187.978 s, 5.7 MB/s

Now the fun starts.

GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 test.zero > test.gz
12.36 seconds CPU 99%

BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 test.zero > test.bz2
32.07 seconds CPU 98%

LZMA
$ /usr/bin/time -f “%U seconds CPU %P” lzma -c9 test.zero > test.lzma
873.79 seconds CPU 96%

So what kind of compression ratios are we talking about here?

$ ls -lh test.zero*
-rw-r–r– 1 kafui kafui  1.0G 2009-03-25 12:01 test.zero
-rw-r–r– 1 kafui kafui 1018K 2009-03-25 12:51 test.gz
-rw-r–r– 1 kafui kafui  148K 2009-03-25 13:10 test.lzma
-rw-r–r– 1 kafui kafui   785 2009-03-25 12:52 test.bz2

GZIP squeezed 1 gigabyte into about 1 megabyte in about 12 seconds… nice. LZMA’s compression ratio was very impressive; it squeezed 1 gigabyte into 148 kilobytes BUT in 873.79 seconds… that’s almost 15 minutes. BZIP2 was absolutely cool… 1Gib down to 785 bytes in 32 seconds! The clear winner here however is BZIP2. It has the highest compression ratio with acceptable time requirements. Now on to tests with real data.

For the next test, I decided to compress the contents of my  /opt folder. To simplify things, I created a tar archive of the folder first.

$ sudo tar -cf opt.tar /opt
[sudo] password for kafui:
tar: Removing leading `/’ from member names
tar: Removing leading `/’ from hard link targets

$ ls -lh opt.tar
-rw-r–r– 1 root root 120M 2009-03-25 15:48 opt.tar

So we’re working with 120MB of data. On to the tests:

GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 opt.tar > opt.tar.gz
19.42 seconds CPU 89%

BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 opt.tar > opt.tar.bz2
30.76 seconds CPU 93%

LZMA
/usr/bin/time -f “%U seconds CPU %P” lzma -c9 opt.tar > opt.tar.lzma
132.21 seconds CPU 92%

$ ls -lh opt.tar*
-rw-r–r– 1 kafui kafui 120M 2009-03-25 15:48 opt.tar
-rw-r–r– 1
kafui kafui 39M 2009-03-25 15:56 opt.tar.gz
-rw-r–r– 1
kafui kafui 36M 2009-03-25 16:09 opt.tar.bz2
-rw-r–r– 1
kafui kafui 25M 2009-03-25 16:16 opt.tar.lzma

Once again, GZIP was the fastest and got 120MB down to 39MB in 19.42 seconds. BZIP2 reduced 120MB to 36MB but took 11.34 seconds longer than GZIP. LZMA delivered the best compression with 25MB but took 132.21 seconds. It appears there are trade-offs with each compression method. GZIP is fast but its compression ratio is the lowest of the three. LZMA (depending on the data) delivers the most efficient compression ratio but takes too much time to do so. BZIP2 strikes a balance between efficient compression and speed… it’s way faster than LZMA and can actually deliver better compression. LZMA just does not live up to the hype.

Unfortunately, these benchmarks were of no use to me because about 140GiB of my data is made up of AVIs, PNGs and JPEGs. These formats are already compressed so there isn’t much room for further compression. But for what it’s worth, I gave the algorithms a spin anyway.

$ ls -lh The.Big.Bang.Theory.S01E10.avi
-rwxrwxrwx 1 kafui kafui 175M 2008-04-18 20:14 The.Big.Bang.Theory.S01E10.avi

GZIP
$ /usr/bin/time -f “%U seconds CPU %P” gzip -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.gz
10.94 seconds CPU 78%

BZIP2
$ /usr/bin/time -f “%U seconds CPU %P” bzip2 -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.bz2
55.15 seconds CPU 94%

LZMA
$ /usr/bin/time -f “%U seconds CPU %P” lzma -c9 The.Big.Bang.Theory.S01E10.avi > The.Big.Bang.Theory.S01E10.avi.lzma
138.74 seconds CPU 93%

$ ls -lh The.Big.Bang.Theory.S01E10.avi*
-rwxr-xr-x 1 kafui kafui 175M 2009-03-25 16:34 The.Big.Bang.Theory.S01E10.avi
-rw-r–r– 1
kafui kafui 173M 2009-03-25 16:35 The.Big.Bang.Theory.S01E10.avi.gz
-rw-r–r– 1
kafui kafui 173M 2009-03-25 16:39 The.Big.Bang.Theory.S01E10.avi.bz2
-rw-r–r– 1
kafui kafui 174M 2009-03-25 16:43 The.Big.Bang.Theory.S01E10.avi.lzma

GZIP and BZIP both got the 175MB episode of The Big Bang Theory down to 173MB; BZIP2 of course took 44.12 seconds longer. And LZMA got it down by only 1MB but in 138.74 seconds. As you can see, it doesn’t make much sense for me to compress my videos and pictures… not with those compression ratios. So it seems I’ll just have to cough up the cedis for a new hard drive. 😦


Responses

  1. Pirate, shame on you! 😛 (anyway it is a funny show).
    Nice benchmarks, why not try rar too, i know it isn’t free but it would nice to see how it stands with the others.

  2. Nice text

    And as above post would be nice to see 7z too 😉

    regards

  3. Sorry to say, but it was pointless from the very beginning. Current lossy compression algorithm used on any media file are very efficient (as the name suggests) in data compression. Otherwise they wouldn’t be used at all. At least, I hope you had a good time testing the compression algorithms 🙂

  4. 7z is more container than some special algorithm, in fact it use LZMA.
    I’m missing here another important point of view – memory consumption for unpacking. LZMA is in that way very good and thus very suitable for embedded environments.

  5. @jjss LOL 😀 I’ll try out rar over the weekend. 🙂

    @mudrii I think Sleep_Walker’s response hits the nail on the head. The 7z format is basically an implementation of the LZMA algorithm.

    @Tosuja Yeah, I guess I was bored and these benchmarks allowed me to kill an hour or so. 😀

    @Sleep_Walker I probably will have to update the post with memory benchmarks. 🙂

  6. Lzma also has very low decompression speed compared to the compression time so that its very suited for distribution where you compress once and decompress many times.

  7. Your test at -c9 isn’t really all that informative, as each algorithm has a different tradeoff of compression ration vs. time to compress. LZMA provides better compression ratios; can you change the parameters so that it takes less time than bzip2 -c9, AND still provides better compression ratios?

    Also as Thomas mentioned above, LZMA is a much faster decompressor than bzip2. Read-only data only needs to be compressed once, but may be decompressed many times.

  8. @Tosuja: Of course, in the real world, one wouldn’t use gzip or bzip2 to compress media files which are already compressed, but they serve as one good source of data for testing these compression algorithms. More complete testing would also include data similar to the intended application in the real-world. For example, compressing ASCII text in httpd server log files is different than compressing binary executables.

    @Odzangba : Another consideration is the time it takes to UNcompress the data. I am working on a small embedded system that uses a PPC-based processor at 400 MHz that barely hits 500 MIPS (similar to an original Pentium at 200 MHz from about 1995). This system runs Ubuntu Linux off of a 4 GB flash drive. We keep a backup copy of the root filesystem in a separate partition on the flash drive. This backup image allows us to recover the system if the root filesystem becomes corrupt. We can make the recovery image about 40% smaller by compressing it, thus saving almost 600 MB, which is a big chunk out of our 4 GB total.

    Since the system has such a small flash drive for storage you might guess that we would be most interested in the best compression of the recovery image. In reality, bzip2 does not save much more than gzip. The bzip2 compressed copy of recovery image was barely 4% smaller than the gzip compressed copy. It turns out that the speed to uncompress the recovery image is a more important factor to us. Remember, our processor is slow! And if push came to shove, adding a bigger flash drive is far cheaper than getting a faster processor (which would also consume more power and add more heat).

    While comparing the time it took to uncompress the recovery image, I found that gunzip is over 2 times faster than bunzip2. It takes almost an hour to restore the system using bunzip2; whereas, it takes less than 25 minutes to restore the system using gunzip. Granted, our recovery feature may be used very rarely (hopefully never!), but which version do you think are we going to ship to our customer?

  9. I recently ran some tests using gzip and bzip2 — independently and then both on the same files. The results astonished me.
    Original: 1667 MB
    gzip: 273 MB
    compress: 58 sec
    uncompress: 14 sec
    bzip2: 144 MB
    compress: 190 sec
    uncompress: 44 sec
    bzip2, then gzip: 37 MB
    compress: 190 sec + 9 sec = 199 sec
    uncompress: 2 + 44 sec = 46 sec
    gzip, then bzip2: 20 MB
    compress: 58 sec + 180 sec = 238 sec
    uncompress 7 + 14 sec = 21 sec

    I was taken completely by surprise by the extent to which either gzip or bzip2 was able to further compress the output of the other program.

  10. Nobody in their right mind would use lzma at maximum compression; the memory requirements alone would cause swapping on all but computers with generous amounts of RAM, and as noted compression takes forever. Re-compressing video is a daft example, as the only bits that are compressible are the packet headers.

    Compressing a 1GiB blank file on my computer (a previous-generation 17″ unibody MBP) with the default compression ratio uses 81MB of RAM and took 692s resulting in 148K. Decompression took 9s, and compression ratio and decompression speed are far more important than compression speed for archived data. With maximum compression, compression took 1184s, consumed 309MB of RAM, and resulted in a file size of 148K. Hmm.

    Then there are things like parallel gzip and parallel bzip2, which make use of multiple cores to speed compression substantially! Bzip2 is slow to decompress compared to all others. Gzip is the universal leader in speed, especially the blindingly fast parallel versions.

    • “Gzip is the universal leader in speed,”

      I beg to differ. Try lzop. It’s much faster than gzip, even gzip –fast. Naturally, lzop is less compression than gzip. But that’s the point.

      lzop is fastest, weakest compression.
      gzip is middle-ground.
      bzip2 doesn’t deserve to exist anymore (because of lzma / xz)
      lzma / xz -1 compresses better and faster than any bzip2. It’s much better compression than either gzip or lzop, but gzip and lzop are both faster.

      And finally, lzma / xz -9 achieves the best compression of all, but ridiculously slow (as this blog indicates.)

      • Sorry dude, cp is even faster. Sure, there’s less compression (none, in fact), but it’s faster. The point of course is that which is better depends on your compression and speed requirements, which tradeoff is best.

      • So, dunno about your data set, but I was compressing some text files, and was hoping to find something that would pack them a bit smaller, than what I was using (bzip2 -9)

        Program % relative to original
        gzip -9 26.5%
        7z -mx=9 19.5%
        lzma -9 19.5%
        bzip2 -9 18.1%

        bzip2 wins

  11. There’s a flaw in your technique here. Since you opt always for compression level 9 … you’re not finding the best result balancing between speed and compression ratio.

    The best result comes from lzma (or xz) -1 instead of -9.

    Please, give it a try. If you use xz -1, you have 2x faster and 2x better compression than any level of bzip2.

    gzip –fast is still faster than xz -1. But xz -1 gets 4x better compression in general than gzip. So then you’ve got to choose which you care about more, and make your own choice.

  12. I would reason that the LZMA is only slower because it catalogs a greater proportion of the entire file before it runs it’s algorithm. This would also explain why it has a lower compression ratio. It’s a natural trade-off between the two. Parallel processing would lower this trade-off and make high compression ratio algorithms more practical for commercial or business needs.

  13. I meant higher compression ratio for LZMA …

  14. is LZMA suitable for a hardware implementation using FPGA ?

    will it work faster when implemented in hardware?

    i’m trying to implement LZMA on FPGA for my final year project..

    can you friends aid me in completion of the project…?

    thanks in advance.. 🙂

  15. He’s not necessarily a pirate! He may have used a PVR to record the show and then renamed it that… Geez.

  16. […] to offer the best compromise between compression time and compression ratio, as reviewed here: https://odzangba.wordpress.com/2009/03/25/gzip-vs-bzip2-vs-lzma/. I was looking for quick and efficient way to zip up some old files for back up. Bz2 won out for […]

  17. I realize that the article was written a while back however am curious to understand as to why and how you could compare an archive format e.g. gzip to compression methods such as bzip2 and lzma

  18. gzip is not an archive format. Tar is. gzip uses compression algorithms as well as bzip2/7z(lzma)

    • English is a terrible language, and easy to make confusing. TAR, RAR, ZIP, and 7Z are archive formats. Ways of grouping multiple files into a single file. GZip, BZip2, LZMA, LZSS+PPMd, ZLib, etc. are compression algorithms. Generally compression algorithms are tied to a specific container format (i.e. 7z uses LZMA, RAR uses LZSS+PPMd, ZIP uses ZLib, etc.), however this is not always the case. Standalone utilities can compress almost all of these formats on a stream level, allowing you to compress a single file. Some of the archive formats (like TAR) are also stream formats, while others allow random file access, like ZIP.

  19. When you said “about 140GiB of my data is made up of AVIs, PNGs and JPEGs” the first thing that came to my mind was porn 😛

  20. […] gzip, bzip2, lzma: bzip2 best on low-entropy data; gzip is the fastest; lzma is the slowest, and also not the best […]

  21. compressed a 4GB mysql dump to 234MB using

    bzip2 -c9 dump_on_13_aug_14.sql > dump_on_13_aug_14.sql.bz2

    time taken
    real 13m31.003s

    • On a smaller sample size, I have a BSON (binary JSON) dump of forums with ~7K users and active for about a year, using single-character attribute names. Original size: 11MB. After a dozen runs, averaged, of bzip2 -9, gz -9, and xz -1 I get: 2.5M (4.343:1) in 1.19s, 3.2M (3.367:1) in 0.86s, and 2.7M (3.968:1) in 0.88s respectively.

      Normalized for per-second efficiency we get:

      BZip2 -9: 3.64
      GZip -9: 3.91
      Xz -1: 4.51

      There’s a pretty clear winner: “fast” xz, and that’s not taking into account memory used during the compression and decompression phases. Xz in “fast” mode (-1) gets a compression ratio between extreme gzip and extreme bzip2 in a speed statistically indistinguishable from gzip. Running xz -4 gives a result smaller than bzip2 -9 in 3.8x the time, so yes, bzip2 does trump xz… up to a point. Xz offers a much better balance between the two extremes.

      If space is the only consideration, however, nothing listed so far will beat xz. Running xz -9 across the same data gives a ratio of 4.66:1—~7% better at 2.3M… just in 5.14x the time as bzip2. This gives it a very low normalized efficiency of 0.76. Since the majority of compression cases are compress-once, decompress-many you can make use of xz to squeeze out every last bit of compressible space in extreme mode, while still getting the most efficiently balanced compression in fast mode.

  22. […] GZIP vs. BZIP2 vs. LZMA | Odzangba Kafui Dake. […]


Leave a comment

Categories