Wednesday, May 4, 2016

Jzip parallel unzipper performance results

In my previous blog post I wrote about reimplementing Info-Zip in modern C++ so that we can unpack files in parallel. This sparked a lively Reddit discussion. One of the main points raised was that once this implementation gets support for all other stuff Info-Zip does, then it will be just as large.

Testing this is simple: just implement all missing features. Apart from encryption support (which you should not use, but gpg instead), the work is now finished and available in jzip github repo. Here are the results.

Lines of code (as counted by wc)


Info-Zip: 82 057 lines of C
jzip: 1091 lines of C++

Stripped binary size


Info-Zip: 159 kB
jzip: 51 kB

Performance

Performance was tested by zipping a full Clang source + build tree into one zip file. This includes both the source, svn directories and all build artifacts. The total size was 9.3 gigabytes. Extraction times were as follows.

Info-Zip: 5m 38s
jzip: 2m 47s

Jzip is roughly twice as fast. This is a bit underwhelming result given that the test machine has 8 cores. Further examination showed that the reason for this was that jzip saturates the hard drive write capacity.

Conclusions

Using a few evenings worth of spare time it is possible to reimplement an established (but relatively straightforward) product with two orders of magnitude less code and massively better performance.

Update: more measurements

Usinag a 48 core machine with fast disks.

Info-zip: 3m 32s
jzip: 12s

On this machine jzip is 95% faster.

On the other hand when running on a machine with a slow disk, jzip may be up to 30% slower because of write contention overhead.

2 comments:

  1. Nice!
    In my book the speed boost is like that:
    (212s-12s)/12s*100%=1666% or 16x faster
    Jussi, are you aware of this compression competition:
    https://encode.su/threads/3421-GDC-Competition-Notices
    Running it in single-threaded mode can set Jzip on the map, it is kinda under the radar at the moment :(

    ReplyDelete
    Replies
    1. That seems to be a competition for new compression tech. Jzip (which is now called Parzip) simply uses Zlib or xz.

      Delete