
Hi there I need to transfer 200+ TB data from one storage server (Red Hat Linux based) to another (FreeBSD). I am planning to use rsync with multiple threads in a script. There are a number of suggestions on the Internet (find + xargs + rsync), but none of them worked well so far. I also need a reliable way to check whether all files/directories from the source server have been copied to the destination server. Any suggestions/help would be appreciated. Regards Bill

On Thu, 16 Feb 2017 12:12:44 PM Bill Yang via luv-main wrote:
I need to transfer 200+ TB data from one storage server (Red Hat Linux based) to another (FreeBSD). I am planning to use rsync with multiple threads in a script. There are a number of suggestions on the Internet (find + xargs + rsync), but none of them worked well so far. I also need a reliable way to check whether all files/directories from the source server have been copied to the destination server. Any suggestions/help would be appreciated.
If the files are reasonably large and can be relied on not to change file data without changing metadata then checking is easy via a final run of rsync -va without the -c option. If the files are small then a lot of the rsync time will be taken up by seeking for metadata so that might not be viable (EG before SSDs became popular you couldn't just run anything like a find / on a large mail server). As for the multiple threads, the common way of doing this is copying by parent directory. For example copying a server you might copy /var and /usr separately. That has the obvious problem that the sizes are often significantly different. If you have lots of files in one directory you could transfer /directory/[a-k]* in one process and /directory/[l-z]* in another. This wouldn't support deleting directories that have been removed from the source but that can be easily fixed with a later pass of rsync -va as long as the files are reasonably large. Maybe it would help if you attached the scripts you tried using with xargs etc so we could see what you tried. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 16/02/17 12:12, Bill Yang via luv-main wrote:
Hi there
I need to transfer 200+ TB data from one storage server (Red Hat Linux based) to another (FreeBSD). I am planning to use rsync with multiple threads in a script. There are a number of suggestions on the Internet (find + xargs + rsync), but none of them worked well so far. I also need a reliable way to check whether all files/directories from the source server have been copied to the destination server. Any suggestions/help would be appreciated.
This may help. http://moo.nac.uci.edu/~hjm/parsync/ scripts running on top of rsync with some crude load balancing and throttling.

Hi Bill, I see you mention that you have tried the find | xargs rsync option without a lot of luck so i will just put this here in case it is different to what you have tried but we are using this quite sucessfully cd <source>; find . -maxdepth 1 -mindepth 1 ! -path './.*' -print0 | xargs -0 -n1 -P<no_of_threads> -I% rsync -irlt % <destination>/. I am however keen to look at the parsync linked by Nic. Thanks for the link Nick Evans On 16 February 2017 at 16:01, Nic Baxter via luv-main <luv-main@luv.asn.au> wrote:
On 16/02/17 12:12, Bill Yang via luv-main wrote:
Hi there
I need to transfer 200+ TB data from one storage server (Red Hat Linux based) to another (FreeBSD). I am planning to use rsync with multiple threads in a script. There are a number of suggestions on the Internet (find + xargs + rsync), but none of them worked well so far. I also need a reliable way to check whether all files/directories from the source server have been copied to the destination server. Any suggestions/help would be appreciated.
This may help. http://moo.nac.uci.edu/~hjm/parsync/ scripts running on top of rsync with some crude load balancing and throttling.
_______________________________________________ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

On Thu, Feb 16, 2017 at 12:12:44PM +1100, Bill Yang wrote:
I need to transfer 200+ TB data from one storage server (Red Hat Linux based) to another (FreeBSD).
I'm curious about a few things; How much does this data change over time? What speed/distance of link you are transferring over? Are you maxing out your disk/network bandwidth already? "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway." — Tanenbaum, Andrew S. (1989)

On Fri, Feb 17, 2017 at 06:25:38PM +1100, Joel W. Shea wrote:
Are you maxing out your disk/network bandwidth already?
This is key, IMO, to whether running multiple rsyncs in parallel is worth it or not. Almost all of the time, rsync is going to be I/O bound (disk and network) rather than CPU bound - so adding more rsync processes is just going to slow them all down even more. A single rsync process can saturate the disk and I/O bandwidth of most common disk subsystems and network connections. about the only time more rsync processes might help is if you're transferring between two servers with SSD storage arrays via a direct-connect 10+Gbps link....and even then, only if the disk + network throughput is at least a few multiples of what a single rsync job (incl. child processes for ssh and/or compression if any) can cope with. or if the source AND destination of each of the multiple rsyncs are on completely separate disks/storage-arrays so they don't compete with each other for disk i/o. e.g. rsync from server1/disk1 to server2/disk1 can run at the same time as an rsync from server1/disk2 to server2/disk2...especially if you can use separate network interfaces for each rsync. splitting up the transfer into multiple smaller rsync jobs to be run consecutively, not simultaneously, can be useful....especially if you intend to run the transfers multiple times to get new/changed/deleted/etc files since the last run. There's a lot of startup overhead (and RAM & CPU usage) with rsync on every run, comparing file lists and file timestamps and/or checksums to figure out what needs to be transferred. Multiple smaller transfers (e.g. of entire subdirectory trees) tend to be noticably much faster than one large transfer. in other words, multiple parallel rsyncs is usually a false optimisation. craig -- craig sanders <cas@taz.net.au>

On Tue, 21 Feb 2017 04:58:52 PM Craig Sanders via luv-main wrote:
On Fri, Feb 17, 2017 at 06:25:38PM +1100, Joel W. Shea wrote:
Are you maxing out your disk/network bandwidth already?
This is key, IMO, to whether running multiple rsyncs in parallel is worth it or not. Almost all of the time, rsync is going to be I/O bound (disk and network) rather than CPU bound - so adding more rsync processes is just going to slow them all down even more. A single rsync process can saturate the disk and I/O bandwidth of most common disk subsystems and network connections.
If you have a RAID-1 array then you should be able to benefit from having as many processes as there are mirrors of the data for reading (IE the transmitting end and the receiver for updating previous data). If you have a RAID-5 then you should get some benefits from multiple readers but it's not as easy to predict. The same applies for command queuing in a single device, but for a much smaller benefit. Linux does some queuing of requests and it's theoretically possible to get some benefits from multiple processes accessing a single disk at one time. But the benefits will probably be small. If you have a process that does some CPU operations as well as some IO there is potential for performance improvement from running multiple processes at once if nothing else is using the disk. For example if the process is using 10% CPU time and 90% iowait then you could get a 10% performance increase by using a second process as there will almost always be a process blocked on disk IO. Apart from the case of 2 processes reading from a RAID-1 device the benefits from all these are small. But for example if you want to transition a server to new hardware or a new DC in an 8 hour downtime window and the transfer looks like it will take 9 hours these are things you really want to do.
splitting up the transfer into multiple smaller rsync jobs to be run consecutively, not simultaneously, can be useful....especially if you intend to run the transfers multiple times to get new/changed/deleted/etc files since the last run. There's a lot of startup overhead (and RAM & CPU usage) with rsync on every run, comparing file lists and file timestamps and/or checksums to figure out what needs to be transferred. Multiple smaller transfers (e.g. of entire subdirectory trees) tend to be noticably much faster than one large transfer.
Yes, especially if you are running out of dentry cache.
in other words, multiple parallel rsyncs is usually a false optimisation.
The thing that concerns me most about such things is the potential for mistakes. For everything you do there is some probability of stuffing it up. Is the probability of a stuff-up a reasonable trade-off for a performance improvement? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Tue, 21 Feb 2017, Craig Sanders wrote:
On Fri, Feb 17, 2017 at 06:25:38PM +1100, Joel W. Shea wrote:
Are you maxing out your disk/network bandwidth already?
This is key, IMO, to whether running multiple rsyncs in parallel is worth it or not. Almost all of the time, rsync is going to be I/O bound (disk and network) rather than CPU bound - so adding more rsync processes is just going to slow them all down even more. A single rsync process can saturate the disk and I/O bandwidth of most common disk subsystems and network connections.
about the only time more rsync processes might help is if you're transferring between two servers with SSD storage arrays via a direct-connect 10+Gbps link....and even then, only if the disk + network throughput is at least a few multiples of what a single rsync job (incl. child processes for ssh and/or compression if any) can cope with.
or if the source AND destination of each of the multiple rsyncs are on completely separate disks/storage-arrays so they don't compete with each other for disk i/o. e.g. rsync from server1/disk1 to server2/disk1 can run at the same time as an rsync from server1/disk2 to server2/disk2...especially if you can use separate network interfaces for each rsync.
Not quite. It matters on read (which can be both sides when you're rewriting data), and not just for arrays with multiple spindles. A single rsync issues one read, the array does it's seek and finds the relevant spindles, and the reading rsync then sends that to the remote, which can cache and reorder as necessary when writing. If you have multiple independent rsyncs, then one rsync blocks on read, a second rsync blocks on read but its required data is closer to the heads, and a third rsync finds another disk in the array that the other rsyncs haven't made busy yet. You can get benefit running more rsyncs than the number of spindles because your block scheduler/raid controller/disk controller knows that one bit of data is closer than another, if there are multiple inflight scsi commands. For writes, you get no benefit unless rsync issues blocking fsyncs (I can't remember if it does - if I had to optimise for data transfer, I'd investigate this and consider using libeatmydata with the caveat that I'd need to manually rerun rsync in the event of a hardware fault soon after any transfers were run). -- Tim Connors
participants (7)
-
Bill Yang
-
Craig Sanders
-
Joel W. Shea
-
Nic Baxter
-
Nick Evans
-
Russell Coker
-
Tim Connors