Obnam repo size with .sql dump files seem too big

Lars Wirzenius liw at liw.fi
Tue Jan 1 17:38:22 GMT 2013

On Tue, Jan 01, 2013 at 11:28:03AM -0500, S. B. wrote:
> Hi everyone,
> I'm using and very much liking Obnam on my webserver to do backups of my
> SQL database dumps.
> However, it seems that the size of my Obnam repository is increasing far
> too much despite minimal changes in my databases. For example, the first
> time I backed up my database .sql dump files, my Obnam repo was 582M. Then,
> just to check, I immediately deleted the .sql dumps and re-dumped the
> databases and ran Obnam again. Now the Obnam repo size went up to 794M. I
> immediately ran the same test again, and the repo size went up to 1006M.
> The database was still running and I'm sure it slightly increased in size,
> but we're talking about kilobytes of difference. Is this normal for
> virtually identical data sets to cause so much additional overhead in
> Obnam? And is there anything I can do to make my sql dumps more
> de-duplicateable? I should mention that they are un-compressed dumps, and
> in Obnam I am using "deflate" compression.

Obnam does de-duplication by splitting up file data into chunks,
and storing those individually. If two files have the same data,
Obnam re-uses the already backed up chunk. So far, so good. However,
due to performance issues, Obnam currently only notices chunks when
they start at integer multiples of the chunk size.

For example, assume a chunk size of 4 bytes, and the following two

    file 1: AAAABBBBCCCC
    file 2: BBBBCCCCAAAA

In this case, Obnam will easily notice that there are three chunks
("AAAA", "BBBB", and "CCCC"), and will store them only once in the
backup repository. However, consider the following file:

    file 3: xAAAABBBBCCCC

File 3 is identical to file 1, except that a new byte has been
inserted into the file. This makes Obnam look at file 3 as four
chunks: "xAAA", "ABBB", "BCCC", and "C". None of these chunks
match the chunks already in the backup repository. Thus, Obnam
thinks they're all new.

There is no technical reason why Obnam could not notice that file 3
only has one inserted byte. However, doing so would require a very
large number of lookups in the repository, and thus would be quite
slow. There may be better ways of noticing the minute difference,
and perhaps someday one of them will be implemented in Obnam.

Note that Obnam does not do a "diff" (or "xdelta" or other such
approach) to notice differences between successive versions of
files. Doing so would make backup generations be dependent on each
other, and re-introduce "full" versus "incremental" backups in a
way that is not acceptable.

With SQL dumps of databases, there are often small changes at
the beginning of of the file, or in the middle of the file, which
makes Obnam's de-duplication work very badly, even if the data as
such has only changed a tiny bit.

Unfortunately, I don't know of a trick that would make the SQL
dumps work better with Obnam. In any case, you should not have
to munge your live data to suit Obnam: Obnam needs to be able
to deal with whatever data you have. Until Obnam's de-duplication
becomes better, though, perhaps someone would have a workaround?

The best idea, untested, I have is to keep the first SQL dump,
in the live data, and then do a new dump before each backup, diff
the two dumps, delete the new dump, and then run the backup. This
way, each successive Obnam backup generation will have two files
(the original SQL dump, and the diff), and you'll need to apply
the diff to get the real dump you need to restore your database.
Does that make sense to anyone?

http://www.cafepress.com/trunktees -- geeky funny T-shirts
http://gtdfh.branchable.com/ -- GTD for hackers
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: Digital signature
Url : http://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org/attachments/20130101/daa0e379/signature-0001.sig

More information about the obnam-support mailing list