lrzip v0.18

Long Range ZIP or Lzma RZIP

This is a compression program optimised for large files. The larger the file
and the more memory you have, the better the compression advantage this will
provide, especially once the files are larger than 100MB. The advantage can
be chosen to be either size (much smaller than bzip2) or speed (much faster
than bzip2). Decompression is always much faster than bzip2.

Lrzip uses an extended version of rzip which does a first pass long distance
redundancy reduction. The lrzip modifications make it scale according to
memory size.
The data is then either:
1. Compressed by lzma (default) which gives excellent compression
at approximately half the speed of bzip2 compression
2. Compressed by lzo which on most machines compresses faster than disk
writing making it as fast (or even faster) than simply copying a large file
3. Leaving it uncompressed and rzip prepared. This form improves substantially
any compression performed on the resulting file in both size and speed (due to
the nature of rzip preparation merging similar compressible blocks of data and
creating a smaller file).
4. Compressed by bzip2 as an rzip-like compression format.

The major disadvantages are:
1. It only works on single files. To get the best performance out of the
compression it is best to tarball all your files together.
2. It requires a lot of memory to get the best performance out of, and is not
really usable (for compression) with less than 256MB. Decompression requires
very little ram and works on small ram machines.
3. Does not work on stdin/stdout.


Example on a 1GB ram P4 3GHz: 

A tarball of a fully compiled kernel tree:
		Size		Compression 	Decompression
base file:	646963200
gzip		218071923	1:27.27		0:45.39	
bzip2		192484690	4:41.62		1:41.20
bzip2 -1	215555795	3:24.08		1:21.45
bzip2 -9	192484690	4:53.18		1:31.40
lzma		112229937	11:48.07	0:56.38
lzma -9		97704505	27:18.77	?
lrzip		88560021	10:11.28	0:57.88
lrzip -l	191415649	0:30.19		0:50.69
lrzip -M	82708048	11:45.79	1:00.75
lrzip -n	389125460	0:31.02		0:58.9

Summary: 	Ratio		Value(Ratio/Time)
gzip		2.97		2.048
bzip2		3.36		0.717
bzip2 -1	3.00		0.883
bzip2 -9	3.36		0.688
lzma		5.76		0.488
lzma -9		6.62		0.242
lrzip		7.31		0.718
lrzip -l	3.38		6.760 *
lrzip -M	7.82 *		0.666
lrzip -n	1.66		3.222


Requires:
liblzo2-dev
libbz2-dev


To build/install:
./configure
make
make install


FAQS.

Q. How do I make a static build?
A. make static

Q. I want the absolute maximum compression I can possibly get, what do I do?
A. Try the -M option. Note it will use all available ram so expect serious
swapping to occur. It may even fail to run if you do not have enough swap
space allocated. Why? Well the more ram lrzip uses the better the compression
it can achieve.

Q. Can I use your tool for even more compression than lzma offers?
A. Yes, the rzip preparation of files makes them more compressible by every
other compression technique I have tried. Using the -n option will generate
a .lrz file smaller than the original which should be more compressible, and
since it is smaller it will compress faster than it otherwise would have.

Q. How about 64bit?
A. As of v0.15 64 bit is working well.

Q. Other operating systems?
A. Patches are welcome. The configure/build system works only on linux at the
moment, but a darwin specific Makefile without configure is included that
should work.

Q. Can it be made to work on stdin/stdout?
A. The rzip design basically works in a way that makes this virtually
impossible.

Q. Really why can't I use stdin/stdout?
A. Well the first compression stage (rzip) takes the largest chunk of the
file your ram can fit and completely reorders all the data in it. Then it
hands over the data in chunks to the compressor. Then it is written to disk.
So theoretically for stdin it could buffer all input till it filled the
chunk size and then start compressing. So adding stdin would not be too big
a stretch. On the other side though, with stdout, the data cannot be
fed to anything till it is completely decompressed and re-ordered into the
original chunk size. Theoretically we could decompress a whole chunk in ram,
reorder it and then start piping it to stdout. This would mean the
decompression ram requirements would almost be as big as the compression
requirements which makes it not portable to machines with less ram. Currently
lrzip uses extraordinarily little amounts of ram on decompression, and is
very fast. Adding stdout support would cancel both of those advantages. The
other option for supporting stdin/stdout is to do each chunk to a separate
file and then feed it. None of these are particularly desirable or practical.
Since stdout support is impractical, there is no point implementing just
stdin.

Q. I still want stdin/stdout?
A. I take patches.

Q. I have another compression format that is even better than lzma, can you
use that?
A. You can use it yourself on rzip prepared files (see above). Alternatively
if the source code is compatible with the GPL license it can be added to the
lrzip source code. Libraries with functions similar to compress() and
decompress() functions of zlib would make the process most painless. Please
tell me if you have such a library so I can include it :)

Q. What's this "Progress percentage pausing during lzma compression" message?
A. While I'm a big fan of progress percentage being visible, unfortunately
lzma compression can't currently be tracked when handing over 100+MB chunks
over to the lzma library. Therefore you'll see progress percentage until
each chunk is handed over to the lzma library. lzo, bzip2 or no compression
doesn't have this problem and shows progress continuously.

Q. What's this "lzo testing for incompressible data" message?
A. The lzma compression is the slowest compression technique in lrzip, and
lzo is the fastest. To help speed up the process, lzo compression is
performed on the data first to test that the data is at all compressible. If
a small block of data is not compressible, it tests progressively larger
blocks until it has tested all the data (if it fails to compress at all). If
no compressible data is found, then lzma compression is not even attempted.
This can save a lot of time during the compression phase when there is
incompressible data. It also works around a known bug that incompressible
data gets the lzma compression library stuck in an endless loop. Theoretically
it may be possible that data is compressible by lzma and not at all by lzo,
but in practice such data achieves only miniscule amounts of compression
which are not worth pursuing. Most of the time it is clear one way or the
other that data is compressible or not.

Q. I Have truckloads of ram so I can compress files much better, but can my
generated file be decompressed on machines with less ram?
A. Yes. Ram requirements for decompression go up only by the -L compression
option with lzma and are never anywhere near as large as the compression
requirements.

Q. Any plans to turn this into a complete archiver?
A. Not really. The compression format relies on being fed large files, and
tar does a good job of this already. Maybe I should include a script with
lrzip that automates what tar+lrzip does.

Q. I've changed the compression level with -L in combination with -l and the
file size doesn't vary?
A. That's right, -l only has one compression level.

Q. Help? I'm a newbie and have no idea how to turn my directory into a
tarball!
A. Here is a walkthrough for a directory called myfiles
to compress:
	tar cf myfiles.tar myfiles
	lrzip myfiles.tar
this will create a file called myfiles.tar.lrz
to extract:
	lrzip -d myfiles.tar.lrz
	tar xf myfiles.tar
will create and extract everything into a directory called myfiles

Q. Why are you including bzip2 compression?
A. To maintain a similar compression format to the original rzip (although the
other modes are more useful).

Q. What about multimedia?
A. Most multimedia is already in a heavily compressed "lossy" format which by
its very nature has very little redundancy. This means that there is not
much that can actually be compressed. If your video/audio/picture is in a
high bitrate, there will be more redundancy than a low bitrate one making it
more suitable to compression. None of the compression techniques in lrzip are
optimised for this sort of data. However, the nature of rzip preparation
means that you'll still get better compression than most normal compression
algorithms give you if you have very large files. ISO images of dvds for
example are best compressed directly instead of individual .VOB files.

Q. Is this multithreaded?
A. Short answer, no.
The main compression advantage of lrzip is that it uses most of the
available ram during compression. To compress with multiple threads or
processes would require just as much ram per thread. This means that any speed
advantage of multithreading would compromise the compression. In lzo 
compression, where speed is everything, the speed is not remotely limited by
cpu performance so adding threading to this would be unhelpful, and would
adversely affect compression. The lzma routine may become multithreaded for
each larger block in the near future if the lzma sdk includes multithreading
across each block.

Q. This uses heaps of memory, can I make it use less?
A. Well you can by setting -w to the lowest value (1) but the huge use of
memory is what makes the compression better than ordinary compression
programs so it defeats the point. You'll still derive benefit with -w 1 but
not as much.

Q. What CFLAGS should I use?
A. With a recent enough compiler (gcc>4) setting both CFLAGS and CXXFLAGS to
	-O3 -march=$archname -fomit-frame-pointer
and putting your architecture into $archname (like pentium4) causes noticeable
speed improvements with lzma without risk of breakage. Because of the c++
code used in lzma, -O3 actually does give demonstrable advantage over -O2
(unlike most c programs).

Q. What compiler does this work with?
A. It has been tested on gcc, ekopath and the intel compiler successfully.
Whether the commercial compilers help or not, I could not tell you.

Q. What codebase are you basing this on?
A. rzip v2.1 and lzma sdk443, but it should be possible to stay in sync with
each of these in the future.

Q. Do we really need yet another compression format?
A. It's not really a new one at all; simply a reimplementation of a few very
good performing ones that will scale with memory and file size.

Q. How do you use lrzip yourself?
A. Two basic uses. I compress large files currently on my drive with the
-l option since it is so quick to get a space saving, and when archiving
data for permament storage I compress it with the default options.

Q. I found a file that compressed better with plain lzma. How can that be?
A. When the file is more than 5 times the size of the compression window
you have available, the efficiency of rzip preparation drops off as a means
of getting better compression. Eventually when the file is large enough,
plain lzma compression will get better ratios. The lrzip compression will be
a lot faster though. Currently I have no way around this problem without
throwing more and more ram at the compression because trying to do this off
disk (whether directly on the file or from swap) will mean the file is read
a ridulous number of times over and over again. It presents an interesting
problem for which there is no perfect solution but it certainly has us
thinking hard about how to tackle it.

Q. Can I use swapspace as ram for lrzip with a massive window?
A. No. To make lrzip work completely from disk would make the data be read
off disk an unrealistic number of times over again and again. For example, if
you have 1GB of ram and a 2GB file to compress, it might read the file a
billion times off disk. Most hard drives would fail in that time :) See the
previous question.

Q. Tell me about patented compression algorithms, GPL, lawyers and copyright.
A. No



BUGS:
Probably lots.


Links:
rzip:
http://rzip.samba.org/
lzo:
http://www.oberhumer.com/opensource/lzo/
lzma:
http://www.7-zip.org/


Thanks to Andrew Tridgell for rzip. Thanks to Markus Oberhumer for lzo.
Thanks to Igor Pavlov for lzma. Thanks to Christian Leber for lzma
compat layer, Michael J Cohen for Darwin support, and everyone else who coded
along the way.


Con Kolivas <kernel@kolivas.org>
Mon, 6 Nov 2006
