Speed up restores by different restore methods / options

Lucas Rolff shared this idea 21 months ago
Completed

The current way JetBackup handles restoration can be a rather slow process, and can be optimized *a lot* in various scenarios.


Full Account Restores


Currently how a full account restore works:

  1. JetBackup asks the backup server to tar.gz the snapshot you want to restore on the backup server, writing it locally on the backup server
  2. JetBackup then scp the file to the webserver where it have to be restored
  3. JetBackup executes a /scripts/restorepkg on the .tar.gz file


This can be optimized by rsyncing the snapshot directly as is (uncompressed), from the backup server to the webserver, and then run /scripts/restorepkg since the restorepkg script supports restoring uncompressed backups.


This will save the time it takes to write the .tar.gz file on the backup server (useless IO).


Using "rsync -> restorepkg" to transfer the files will overall be faster than "tar -> scp -> restorepkg", and put less load on the backup server (which might be doing other things).


Using rsync there should be the possibility to enable or disable compression, since in most cases, the compression time from sync actually isn't faster than sending the uncompressed files over a 1 gigabit link (only if you can have a high compression ratio, which to be honest, rarely happens for most hosting accounts)


File Restores

The same principle goes for File restores:


  1. JetBackup asks the backup server to tar.gz a specific set of files and/or directories
  2. JetBackup then scp the file
  3. JetBackup Uncompresses the file on a temporary location
  4. JetBackup fixes permissions
  5. JetBackup moves the files to the correct location


Here we have a few additional steps: Uncompress the file to temp location and fix permissions.


The fixing of permissions isn't really an issue, The file system will be fast doing it, since the inodes are already in the disk buffers, since we just uncompressed the files.


Uncompressing a file isn't an issue either - but the fact the files in first place are compressed will be the issue - we're compressing something to copy it, to then uncompress it again - this causes useless IO on both the backup server (since we have to read files to compress, and at same time write the compressed file (Read useless IO))


Once again we can make use of rsync (as you do with backups), to transfer the files to the destination server.

This has the benefit that we save a lot of writes on the backup server, we can transfer them using compression or no compression in rsync - once again, depending on the compression ratio, the compression might actually take longer than the network overhead we'd have by transferring files uncompressed.


Footnotes

[1] I was informed by support that the reason compression is used is due to restoration times were faster by compressing -> scp -> restore - this was measured in the first initial versions


Note: I believe that this is true, but years has passed by, majority of backup servers are either connected to 1 or 10 gigabit networking these days (if not more).

a 1 gigabit connection can utilize 125 megabyte per second (Let's be fair and say 120 megabyte to take care of some TCP/IP overhead)


Most disk arrays (even raid 6) can surely write more than 125 megabyte per second, but it's not write speed that is the issue, it's random IO.


Most backup servers are large capacity, high redundancy setups, either using raid 5, 6, 50 or 60, spinning disks (Nearline SAS or SATA 7.2k disks) - these do awful random IO - therefore limit the amount of random IO you perform - having reads & writes at same time in such kind of servers usually isn't very pretty, and the cause of it, is that effectively you cannot utilize the same amount of actual bandwidth as you can utilize with 1 gigabit and read only IO.


In the 100 megabit days - sure, might be possible, but it really isn't the case anymore - so as much as I love that you have data about it - the fact it was measured in initial versions, isn't the same as measuring it in 2017 - because the world is very different today than it was years ago.


For same reason, I believe we should have the option for selecting tar or rsync as a restoration method - then people can decide whatever they want to use - and what they find fastest.


[2] Compression isn't always optimal - when using compression (such as gzip, or zlib), it's not always optimal, it all depends on the compression ratio you can get from your files, if your compression ratio is less than 1.5 (which is pretty often the case for hosting accounts), then the CPU overhead (even with fast CPUs) are actually higher than the additional transferred bytes of your backups (being as a .tar archive or using rsync with compression).


I suggest that this becomes a configurable option if we want to use compression or not - since it can greatly introduce additional restoration (and backup) time by compressing the transfer or the archive.


We did multiple examples of account restores for all different kind of scenarios - the only time where using compression actually was a benefit, was with an account that contained very few images, and *a lot* of bigger text files - because the amount of bytes saved was so high.


But 99% of the cases resulted in compression just adding additional time to the whole process, sometimes seconds, sometimes many minutes additional - a few examples, all tests were performed where we dropped the servers cache before every transfer, this gives us a very real world example of where files are stored on a backup server, and we want to restore them.


Additionally the backup server were configured with a raid 6, using enterprise grade SATA disks (Ultrastar 7K6000) which are known for rather good random IO, and speeds in general.

The system has at same time been optimized for better IO when it comes to small files such as tweaking the stripe size, buffer sizes etc - meaning our overall IO performance are 20% faster compared to stock CentOS 7 kernel and array settings.

The connection the server was 1 gigabit (shared), and the test being performed between two datacenters, two different networks.


---------------


Account: 4.28 gigabyte in size with 244.072 files:


rsync with compression: 4 minutes and 9 seconds (speedup is 1.75)

rsync without compression : 3 minutes and 30 seconds (speedup is 1.00)

tar stream with extract on destination: 4 minutes and 39 seconds

tar.gz stream with extract on destination: 5 minutes and 30 seconds

tar on backup -> SCP: 4 minutes and 59 seconds + 42 seconds to scp


The method jetbackup uses matches the last one, except JetBackup also uses gzipping - this would only have added time.


---------------


Account: 5.53 gigabyte and 49.665 files


rsync with compression: 1 minute and 29 seconds (speedup is 1.36)

rsync without compression: 1 minute and 1 second (speedup is 1.00)

tar on backup -> SCP: 2 minutes and 2 seconds


---------------


Account: 13.5 gigabyte and 208.289 files


rsync with compression: 5 minutes and 9 seconds (speedup is 1.42)

rsync without compression: 2 minutes and 24 seconds (speedup is 1.00)

tar on backup -> SCP: 6 minutes and 39 seconds (due to random IO and amount of files)


---------------


Account: 23.29 gigabyte and 581.969 files


rsync with compression: 19 minutes and 23 seconds (speedup is 1.08)

rsync without compression: 11 minutes and 39 seconds (speedup is 1.00)

tar on backup -> SCP: 26 minutes and 49 seconds

Comments (6)

photo
81

Hi,


Thank you for your feedback :)


As you mentioned, We actually did took the "rsync -> restorepkg" approach first, but found it was slower.


We measured the restore times with 150 servers (lan connection) and found that rsyncing only one file is much more faster, even if you consider the time it takes you to create the tar.gz file. Rsyncing 100k file over wan can take a long time, while if you do the same with only one file - it will be very fast.


We can however consider taking it out to settings -> you will choose which restore approach you want.


We are currently doing heavy coding for the new 3.2 version is a major core change and improvments, this will be considered as well.


Thanks,

Eli.

photo
88

Hi Eli,


> We measured the restore times with 150 servers (lan connection) and found that rsyncing only one file is much more faster


I've been able to test following backup server configurations:

- 1 disk no raid

- 2 spinning disks raid 1

- 3 spinning disks raid 5

- 4 spinning disks raid 6

- 6 spinning disks raid 6

- 2 SSDs in raid 1


All has been tested on 100, 1000 and 10000 mbit (public) links with between 10 and 40ms latency to the source server.


None of the above configs except the 100 mbit base was faster than sending with rsync directly (both with and without compression).


> Rsyncing 100k file over wan can take a long time, while if you do the same with only one file - it will be very fast.


If you have a really high latency towards your restore destination, then sure, rsyncing 100k files can take some time - however, taring 100k files can also take a lot of time to your backup media.

Sure if you use SAS/SSD configs in your backup server, you can do *a lot* of IOPs to make the tar, but generally speaking, you won't be able to do much iops on a backup server anyway.


I don't know if you actually read the numbers above, but I did rsync 581.000 files and it was still faster than using tar and then transfer the *single* file over.


The other tests I did both with no raid, raid 1, 5 and 6 showed the same result.

When you tar things, you also have to extract it on the other end, meaning if you extract it, you request the destination to be able to do a large amount of iops meanwhile unpacking.


Sure a bunch of servers can do this, but you're still requesting more iops from both systems by using tar.


The SSD config with taring huge amount of files (only mattered with 500k+ files) was about same speed as rsync, however - people do not really use SSDs in their backup servers (if they do, they have too much money).


I'm not sure how you performed these tests, but one thing that is very important when performing tests, is actually dropping all system caches, to give a real picture of the tar -> scp -> restore, or rsync, because if the files are within the cached memory of the server, sure tar can be super fast even on slow disks.

The fact however is that the files you want to restore, are most likely not in the cached memory (e.g. I only have 32-64 gigabyte memory in a backup server, but store 10 terabytes of data.. Won't fit in memory).


There's a point where it can become debatable if tar -> scp -> untar is faster than rsync, it does happen with super tiny files - in my above examples, average file size was 116, 68 and 42 kilobytes (top to bottom) - it's rather normal hosting accounts with wordpress, joomla, magento, prestashop sites etc, so it represents quite normal setups.


> We can however consider taking it out to settings -> you will choose which restore approach you want.


That would be nice, so I can actually get fast restores with Jetbackup without having to use my own custom restore script which I'm currently doing.

photo
89

Hey,


On version 3.2.x we stopped compressing the backup files on the remote server.

We are rsync them back to the server (tmp folder) and the restoring the account.

photo
64

Hello.

Do you backup each site/account alone and the proceed with the next one after deleting it from tmp or all together? I'm asking because compressing backups on the live server is not optimal for cases where hdd or ssd are more than 50% full.

photo
58

Hi,


We don't use tmp folder, we sync the files directly from the homedir.

For compressed account backups, we compress them one-by-one, sync to remote folder and delete.


Thanks,

Eli.

photo
60

Lucas,

I am marking this is completed as we changed the whole concept in version 3.2 and it's pretty much close to what you desire.

If possible, please install/upgrade to version 3.2, do a second review and share your wisdom :)


Thanks,

Eli.

photo