Best way to copy a large directory tree

What's the best way to copy a large directory tree (around 3TB in total) with a combination of large and small files? The files currently reside on my NAS which is on my LAN (connected via gigabit ethernet) and are mounted on my system as a NFS share. I would like to copy all files/directories to an external hard disk connected via USB. I care about speed, but I also care about reliability, making sure that every file is copied, that all metadata is preserved and that errors are handled gracefully. I've done some research and currently, I am thinking of using tar or rsync, or a combination of the two. Something like: tar --ignore-failed-read -C $SRC -cpf - . | tar --ignore-failed-read -C $DEST -xpvf - to copy everything initially, and then rsync -ahSD --ignore-errors --force --delete --stats $SRC/ $DIR/ To check everything with rsync. What do you guys think about this? Am I missing something? Are there better tools for this? Or other useful options for tar and rsync that I am missing? Cheers -- Aryan

On 26/03/2013, at 6:44, Aryan Ameri <info@ameri.me> wrote:
What's the best way to copy a large directory tree (around 3TB in total) with a combination of large and small files? The files currently reside on my NAS which is on my LAN (connected via gigabit ethernet) and are mounted on my system as a NFS share. I would like to copy all files/directories to an external hard disk connected via USB.
I care about speed, but I also care about reliability, making sure that every file is copied, that all metadata is preserved and that errors are handled gracefully. I've done some research and currently, I am thinking of using tar or rsync, or a combination of the two. Something like:
tar --ignore-failed-read -C $SRC -cpf - . | tar --ignore-failed-read -C $DEST -xpvf -
to copy everything initially, and then
rsync -ahSD --ignore-errors --force --delete --stats $SRC/ $DIR/
To check everything with rsync.
What do you guys think about this? Am I missing something? Are there better tools for this? Or other useful options for tar and rsync that I am missing?
Cheers
-- Aryan
Home NAS devices are very CPU limited so compressing files is the last thing you want to do. An rsync still has to build file lists and calculate a hash for each but this process will be much faster. I suggest an initial rsync to move the bulk of the data and then a final pass when you're ready to stop using the NAS. If you want absolute speed dd will be much faster than both. Edward

On Tue, Mar 26, 2013 at 8:08 AM, Edward Savage <epssyis@gmail.com> wrote:
Home NAS devices are very CPU limited so compressing files is the last thing you want to do.
Correct me if I am mistaken, but I thought the machine that's going to be compressing and decompressing is my workstation/desktop, so the CPU that will be taxed is that one, no? How does it then have anything to do with the CPU in my NAS? Cheers -- Aryan

On Tue, Mar 26, 2013 at 8:18 AM, Aryan Ameri <info@ameri.me> wrote:
On Tue, Mar 26, 2013 at 8:08 AM, Edward Savage <epssyis@gmail.com> wrote:
Home NAS devices are very CPU limited so compressing files is the last thing you want to do.
Correct me if I am mistaken, but I thought the machine that's going to be compressing and decompressing is my workstation/desktop, so the CPU that will be taxed is that one, no? How does it then have anything to do with the CPU in my NAS?
Yes that is correct, if you are taking files from the NAS then compressing them on your local machine then uncompressing them on the same local machine then putting them on the external USB drive, why compress/uncompress at all ? At the end of the day you are still moving uncompressed data across the network so you are just adding extra steps with tar...
Cheers -- Aryan _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
-- Mark "Pockets" Clohesy Mob Phone: (+61) 406 417 877 Email: hiddensoul@twistedsouls.com G-Talk: mark.clohesy@gmail.com GNU/Linux..Linux Counter #457297 - "I would love to change the world, but they won't give me the source code" "Linux is user friendly...its just selective about who its friends are"

On Tue, Mar 26, 2013 at 8:39 AM, Hiddensoul (Mark Clohesy) <hiddensoul@twistedsouls.com> wrote:
At the end of the day you are still moving uncompressed data across the network so you are just adding extra steps with tar...
Okay, so what do you recommend? What's the right tool to use? cp? cpio? dd? dar? rsync? I am looking for something that's reliable, that can handle read errors without exiting, that can continue the operation where it was left off if it was interrupted, that can preserve all metadata and symlinks. What's the right tool to use for this? Cheers -- Aryan

Aryan Ameri <info@ameri.me> wrote:
I am looking for something that's reliable, that can handle read errors without exiting, that can continue the operation where it was left off if it was interrupted, that can preserve all metadata and symlinks. What's the right tool to use for this?
Rsync can do most of the above.

I also just realised, that I never actually wanted to do compression or de-compression in the first place! The command I originally proposed is: tar --ignore-failed-read -C $SRC -cpf - . | tar --ignore-failed-read -C $DEST -xpvf - which as far as I can tell, has no compression in it, no? Cheers -- Aryan On Tue, Mar 26, 2013 at 9:39 AM, Jason White <jason@jasonjgw.net> wrote:
Aryan Ameri <info@ameri.me> wrote:
I am looking for something that's reliable, that can handle read errors without exiting, that can continue the operation where it was left off if it was interrupted, that can preserve all metadata and symlinks. What's the right tool to use for this?
Rsync can do most of the above.
_______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main

On Tue, Mar 26, 2013 at 08:51:10AM +1100, Aryan Ameri wrote:
I am looking for something that's reliable, that can handle read errors without exiting, that can continue the operation where it was left off if it was interrupted, that can preserve all metadata and symlinks. What's the right tool to use for this?
rsync repeat runs of rsync until it finishes without error. craig

Hi there Just a point to consider: Does your NAS have usb ports on it? If so, I'd log into there and do it all there. It does seem that an awful lot of time is being wasted here with network latency. Especially if the NAS has any USB 3 ports. Personally I would use rsync all the way with this stuff but I'm not likely to be the most knowledgeable on that one. Cheers Paul Miller On 26 March 2013 08:39, Hiddensoul (Mark Clohesy) < hiddensoul@twistedsouls.com> wrote:
On Tue, Mar 26, 2013 at 8:18 AM, Aryan Ameri <info@ameri.me> wrote:
On Tue, Mar 26, 2013 at 8:08 AM, Edward Savage <epssyis@gmail.com> wrote:
Home NAS devices are very CPU limited so compressing files is the last thing you want to do.
Correct me if I am mistaken, but I thought the machine that's going to be compressing and decompressing is my workstation/desktop, so the CPU that will be taxed is that one, no? How does it then have anything to do with the CPU in my NAS?
Yes that is correct, if you are taking files from the NAS then compressing them on your local machine then uncompressing them on the same local machine then putting them on the external USB drive, why compress/uncompress at all ?
At the end of the day you are still moving uncompressed data across the network so you are just adding extra steps with tar...
Cheers -- Aryan _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
--
Mark "Pockets" Clohesy Mob Phone: (+61) 406 417 877 Email: hiddensoul@twistedsouls.com G-Talk: mark.clohesy@gmail.com GNU/Linux..Linux Counter #457297 - "I would love to change the world, but they won't give me the source code" "Linux is user friendly...its just selective about who its friends are"
_______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main

On Tue, Mar 26, 2013 at 08:39:33AM +1100, Hiddensoul (Mark Clohesy) wrote:
Yes that is correct, if you are taking files from the NAS then compressing them on your local machine then uncompressing them on the same local machine then putting them on the external USB drive, why compress/uncompress at all ?
At the end of the day you are still moving uncompressed data across the network so you are just adding extra steps with tar...
tar doesn't compress unless you tell it to with one of the compression options such as '-z' (gzip) or '-j' (bzip2). any recent GNU tar (i.e. from the last few years) has several compression options available, and '-I' if you need to tell it to use a particular program that doesn't have it's own specific option. for the initial copy, 'cp -af' or 'tar cf - ... | tar xf - ...' would be a reasonable choice. rsync wouldn't be bad either, but would use significantly more memory and CPU. follow up with a final rsync (or two or several dozen - once the bulk copy has been done you can keep rsyncing daily or whatever until you're ready) in short, it doesn't really matter what you use as long as the machine doing the copying has sufficient RAM and CPU power. craig -- craig sanders <cas@taz.net.au>

Edward Savage <epssyis@gmail.com> writes:
tar --ignore-failed-read -C $SRC -cpf - . | tar --ignore-failed-read -C $DEST -xpvf - Home NAS devices are very CPU limited so compressing files is the last thing you want to do.
I don't see where he's compressing anything. OP: if you use tar you'll want to be root on both sides or you'll lose ownership &c stuff. Also you MAY want --numeric-owner.
An rsync still has to build file lists and calculate a hash for each but this process will be much faster. I suggest an initial rsync to move the bulk of the data and then a final pass when you're ready to stop using the NAS.
I would say just use rsync for both runs because it's less hassle, and if you interrupt it halfway you can pick up where you left off (tar obviously won't).

Aryan Ameri <info@ameri.me> writes:
rsync -ahSD --ignore-errors --force --delete --stats $SRC/ $DIR/
Note -a does not do -HSX, and you're only passing -S. Both -H and -S have significant overhead, so pass them iff you need them. -a implies -D, so don't bother. Looks like -a also omits -A, though that's never stung me. -X stung me once due to capabilities(7); IIRC samba also uses it.

On Tue, Mar 26, 2013 at 12:04 PM, Trent W. Buck <trentbuck@gmail.com> wrote:
Note -a does not do -HSX, and you're only passing -S. Both -H and -S have significant overhead, so pass them iff you need them.
-a implies -D, so don't bother.
Looks like -a also omits -A, though that's never stung me. -X stung me once due to capabilities(7); IIRC samba also uses it.
Thanks to everyone for their very helpful advice, especially Craig, Mike and Trent. I've done a few test runs on a few small subdirectories, and the bottleneck seems to be the USB 2.0 interface of my computer, with plenty of RAM, CPU and network capacity to spare. On a test run copying to my internal SSD, tar was significantly faster than rsync (around 5X faster in my particular setup), but when copying to the external HDD, as I said it was the USB 2.0 interface that was bottleneck, so I've decided that I might as well use the more reliable tool, rsync for the job. Also thanks for the tips on rsync. I can't believe that in over 14 years of using Linux as my main desktop OS, I had never used this tool. Looking back, it could have been very useful in a few situations were cp and cpio and tar left a lot to be desired. Cheers -- Aryan

Aryan Ameri <info@ameri.me> wrote:
Thanks to everyone for their very helpful advice, especially Craig, Mike and Trent. I've done a few test runs on a few small subdirectories, and the bottleneck seems to be the USB 2.0 interface of my computer, with plenty of RAM, CPU and network capacity to spare.
USB 2 is notoriously slow in this scenario. For details, look up Sarah Sharp's talk at LCA several years ago on USB 3, which solves the problems. As I recall, it isn't the actual data transfer rate that causes the poor performance, but I can't remember the details now.

Jason White wrote:
Aryan Ameri <info@ameri.me> wrote:
Thanks to everyone for their very helpful ............snip USB 2 is notoriously slow in this scenario. For details, look up Sarah Sharp's talk at LCA several years ago on USB 3, which solves the problems. As I recall, it isn't the actual data transfer rate that causes the poor performance, but I can't remember the details now. Apologies for digression; all I could find of Sarah's talk were some presentation slides ; which didn't seem very informative...........not with-standing ; Wikipedia quotes maximum data transfer rates: for USB 2.0 at 480 Mb/s and USB 3.0 at 5Gb/s; is the above to be read as stating that; actually one can expect USB 3.0 to USB 2.0 ratio > (5Gb/s / 480Mb/s) = 10 ? regards Rohan McLeod
_______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main

On Wed, Mar 27, 2013 at 4:24 PM, Rohan McLeod <rhn@jeack.com.au> wrote:
Jason White wrote: Wikipedia quotes maximum data transfer rates: for USB 2.0 at 480 Mb/s and USB 3.0 at 5Gb/s;
In my experience the maximum data rate of USB 2.0 has zero correlation with reality. Which is why I loved Firewire... pity everyone gave up on it! Cheers -- Aryan

On Wed, Mar 27, 2013 at 04:24:18PM +1100, Rohan McLeod wrote:
Jason White wrote:
Aryan Ameri <info@ameri.me> wrote:
Thanks to everyone for their very helpful ............snip USB 2 is notoriously slow in this scenario. For details, look up Sarah Sharp's talk at LCA several years ago on USB 3, which solves the problems. As I recall, it isn't the actual data transfer rate that causes the poor performance, but I can't remember the details now. Apologies for digression; all I could find of Sarah's talk were some presentation slides ; which didn't seem very informative...........not with-standing ; Wikipedia quotes maximum data transfer rates: for USB 2.0 at 480 Mb/s and USB 3.0 at 5Gb/s; is the above to be read as stating that; actually one can expect USB 3.0 to USB 2.0 ratio > (5Gb/s / 480Mb/s) = 10 ? regards Rohan McLeod
Continuing the digression, you can find Sarah Sharp's talk here: http://mirror.internode.on.net/pub/linux.conf.au/2010/friday/50230.ogv Stoo

Hi Aryan I've been using rsync to migrate whole hard drives for years. In my option its the only way to do it. rsync -aPSvx --numeric-ids --delete <source-directory>/ <destination-host>:<destination-directory>/ If you want compression you can add a 'z' ssh between machines used if you add the destination host. The '/' are very important or you will end up deleting the wrong thing when you run rsync multiply times. I use vservers (www.linux-vservers.org) which means I copy the files a few times then shutdown the virtual guest and run rsync once then start it up on the new machine. Cheers Mike On 26/03/13 6:14 AM, Aryan Ameri wrote:
What's the best way to copy a large directory tree (around 3TB in total) with a combination of large and small files? The files currently reside on my NAS which is on my LAN (connected via gigabit ethernet) and are mounted on my system as a NFS share. I would like to copy all files/directories to an external hard disk connected via USB.
I care about speed, but I also care about reliability, making sure that every file is copied, that all metadata is preserved and that errors are handled gracefully. I've done some research and currently, I am thinking of using tar or rsync, or a combination of the two. Something like:
tar --ignore-failed-read -C $SRC -cpf - . | tar --ignore-failed-read -C $DEST -xpvf -
to copy everything initially, and then
rsync -ahSD --ignore-errors --force --delete --stats $SRC/ $DIR/
To check everything with rsync.
What do you guys think about this? Am I missing something? Are there better tools for this? Or other useful options for tar and rsync that I am missing?
Cheers
-- Aryan _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main

On Tue, 26 Mar 2013, Aryan Ameri <info@ameri.me> wrote:
What's the best way to copy a large directory tree (around 3TB in total) with a combination of large and small files? The files currently reside on my NAS which is on my LAN (connected via gigabit ethernet) and are mounted on my system as a NFS share. I would like to copy all files/directories to an external hard disk connected via USB.
As others have already suggested, I recommend rsync. Use it without the -c option (eg "rsync -va") and you can run it multiple times with little overhead. The -c option makes it do checksums (IE read all file data) but without it only metadata is checked and the cache on modern systems will generally cover that. One problem I've had with combining rsync and something else is getting the use of trailing / characters wrong and copying things twice. On Tue, 26 Mar 2013, Paul Miller <paul.miller@rmit.edu.au> wrote:
Does your NAS have usb ports on it?
If so, I'd log into there and do it all there. It does seem that an awful lot of time is being wasted here with network latency.
Especially if the NAS has any USB 3 ports.
USB 3 has a theoretical speed of 5Gbit/s or about 500MB/s. If the NAS and drive both support it then it will be faster than Gig-E. If either the NAS or the drive are USB 2 then you have a theoretical speed of 480Mbit/s and a maximum that I've measured of about 35MB/s. I doubt that a file sharing protocol could take 3* the data of a filesystem on a block device to copy files so the bottleneck when using GigE to copy files to a USB 2.0 device should be USB. Of course using GigE will add a little latency which will slow things down a bit, but then you could run two copies of rsync at the same time to stop that being a problem. On Tue, 26 Mar 2013, "Trent W. Buck" <trentbuck@gmail.com> wrote:
Note -a does not do -HSX, and you're only passing -S. Both -H and -S have significant overhead, so pass them iff you need them.
Why does -S have significant overhead? If the source file doesn't have blocks of zeros then there shouldn't be any difference, if it does have zero blocks then it's just replacing a write with a seek. On Wed, 27 Mar 2013, Jason White <jason@jasonjgw.net> wrote:
USB 2 is notoriously slow in this scenario. For details, look up Sarah Sharp's talk at LCA several years ago on USB 3, which solves the problems. As I recall, it isn't the actual data transfer rate that causes the poor performance, but I can't remember the details now.
http://en.wikipedia.org/wiki/USB_2#USB_2.0_.28High_Speed.29 Wikipedia says that USB 2 has a "maximum signaling rate of 480 Mbit/s (effective throughput up to 35 MB/s or 280 Mbit/s)". So it seems that it's about 58% efficient. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Russell Coker <russell@coker.com.au> writes:
On Tue, 26 Mar 2013, "Trent W. Buck" <trentbuck@gmail.com> wrote:
Note -a does not do -HSX, and you're only passing -S. Both -H and -S have significant overhead, so pass them iff you need them.
Why does -S have significant overhead? If the source file doesn't have blocks of zeros then there shouldn't be any difference, if it does have zero blocks then it's just replacing a write with a seek.
Mea culpa, I was speaking from memory and it looks I remembered wrong. As the manpage says, -a, --archive archive mode; equals -rlptgoD (no -H,-A,-X) So sparse isn't in there, but it doesn't have significant overheads. It *does* conflict with --in-place, which I didn't mention until now.
participants (11)
-
Aryan Ameri
-
Craig Sanders
-
Edward Savage
-
Hiddensoul (Mark Clohesy)
-
Jason White
-
Mike O'Connor
-
Paul Miller
-
Rohan McLeod
-
Russell Coker
-
Stewart Johnston
-
trentbuck@gmail.com