
Hi All, I recently experienced an SSD failure, and so I have purchased another to set up my system again. I received some substantial help from this list early in 2019 to build my machine with this SSD as / and /home under Ubuntu 18.04 with two x 2Tb conventional drives in RAID for storing my work, all are running btrfs. After the machine was running I was asked if I had set up the machine using Ubuntu Server, I hadn't, because at that time I didn't see those options. I am thinking, then, for this build, perhaps I should set it up using Ubuntu Server. I will need to get my system to recognise the RAID drives as well. So before I jump in the deep end again, are there any "gotchas" of which I should be aware. Will the server version make life more reliable? Many thanks Andrew Greig

On Fri, 17 Jan 2020 at 11:36, Andrew Greig via luv-main <luv-main@luv.asn.au> wrote:
I am thinking, then, for this build, perhaps I should set it up using Ubuntu Server. I will need to get my system to recognise the RAID drives as well.
So before I jump in the deep end again, are there any "gotchas" of which I should be aware.
Will the server version make life more reliable?
Under the hood they're identical i.e. same kernel, same core system apps etc. They differ in that the desktop versions have GUIs installed by default and come pre-packaged with desktop focused everyday use apps, whereas the server version doesn't. During installation of the server version you usually select a role for the server (or not) e.g. LAMP, mail, print, samba etc. You don't get a GUI and typically administer it via CLI over SSH. Nor do you get all the other guff that comes with a desktop version, so in some respect it's relatively clean. You can then add a GUI if you so wish. Each to their own, depends on what *you* want to do with it. -- Colin Fee tfeccles@gmail.com

On 17/1/20 12:48 pm, Colin Fee via luv-main wrote:
On Fri, 17 Jan 2020 at 11:36, Andrew Greig via luv-main <luv-main@luv.asn.au <mailto:luv-main@luv.asn.au>> wrote:
I am thinking, then, for this build, perhaps I should set it up using Ubuntu Server. I will need to get my system to recognise the RAID drives as well.
So before I jump in the deep end again, are there any "gotchas" of which I should be aware.
Will the server version make life more reliable?
Under the hood they're identical i.e. same kernel, same core system apps etc. They differ in that the desktop versions have GUIs installed by default and come pre-packaged with desktop focused everyday use apps, whereas the server version doesn't.
During installation of the server version you usually select a role for the server (or not) e.g. LAMP, mail, print, samba etc. You don't get a GUI and typically administer it via CLI over SSH. Nor do you get all the other guff that comes with a desktop version, so in some respect it's relatively clean.
You can then add a GUI if you so wish. Each to their own, depends on what *you* want to do with it. -- Colin Fee tfeccles@gmail.com <mailto:tfeccles@gmail.com>
Thanks Colin, I found a summary I had written for myself, it's in my next message, there are a couple of questions I have included there because the data drives are already holding data. Andrew

On Fri, Jan 17, 2020 at 11:36:29AM +1100, Andrew Greig wrote:
I recently experienced an SSD failure, and so I have purchased another to set up my system again. I received some substantial help from this list early in 2019 to build my machine with this SSD as / and /home under Ubuntu 18.04 with two x 2Tb conventional drives in RAID for storing my work, all are running btrfs.
You lost your home dir and the data in it when your SSD failed Because your rootfs and /home on the SSD doesn't have any redundancy (i.e. it was a single partition, with no RAID). I strongly recommend setting up a cron job to regularly snapshot it (at least once/day) and do a 'btrfs send' of that snapshot to a sub-volume of your /data filesystem. That way you won't lose much data from that partition if your SSD dies again - you can retrieve it from the last snapshot backup, and will only lose any changes since then. If your / and /home are on separate partitions (or btrfs sub-volumes) you will need to do this for both of them. (if you weren't running btrfs on /, you could do this with rsync instead of 'btrfs send', but rsync would be a lot slower) IME, drives are fragile and prone to failure. It's always best to make plans and backup procedures so that WHEN (not IF) a drive fails, you don't lose anything important...or, at least, minimise your losses. Also, remember that RAID is not a substitute for backup so you should regularly backup your /data filesystem to tape or other drives. Ideally, you should try to have an off-site backup in case of fire/flood/etc (e.g. backup to an external USB drive and store it at your office, lawyer's safe, a friend's house or somewhere. Have at least two of these so you can rotate the offsite backups).
After the machine was running I was asked if I had set up the machine using Ubuntu Server, I hadn't, because at that time I didn't see those options.
I am thinking, then, for this build, perhaps I should set it up using Ubuntu Server. I will need to get my system to recognise the RAID drives as well.
If the installer doesn't automatically detect your /data btrfs filesystem and add it to /etc/fstab, it's easy enough to add it yourself.
So before I jump in the deep end again, are there any "gotchas" of which I should be aware.
Will the server version make life more reliable?
the only significant difference between the server and desktop versions of ubuntu are the packages which are installed by default. e.g. the desktop version installs a whole bunch of desktop stuff (X, desktop environment and GUI apps, etc) that the server version doesn't. Otherwise, they're the same - same kernel, same libc and other standard system libraries, etc. craig -- craig sanders <cas@taz.net.au>

Thanks Craig,I have elected to start with a Ubuntu 18.04 LTS desktop install.The Raid drives were picked up, ie are available, but does the balance command need to be issued again?I had two lines to set up the raid and balance them at the start. I suspect that without those commands only one drive will be written to.Thanks for your assistanceAndrewSent from Samsung tablet. -------- Original message --------From: Craig Sanders via luv-main <luv-main@luv.asn.au> Date: 18/1/20 12:59 pm (GMT+10:00) To: luv-main@luv.asn.au Subject: Re: Rebuild after disk fail On Fri, Jan 17, 2020 at 11:36:29AM +1100, Andrew Greig wrote:> I recently experienced an SSD failure, and so I have purchased another to> set up my system again. I received some substantial help from this list> early in 2019 to build my machine with this SSD as / and /home under Ubuntu> 18.04 with two x 2Tb conventional drives in RAID for storing my work, all> are running btrfs.You lost your home dir and the data in it when your SSD failed Because yourrootfs and /home on the SSD doesn't have any redundancy (i.e. it was a singlepartition, with no RAID). I strongly recommend setting up a cron job toregularly snapshot it (at least once/day) and do a 'btrfs send' of thatsnapshot to a sub-volume of your /data filesystem.That way you won't lose much data from that partition if your SSD dies again- you can retrieve it from the last snapshot backup, and will only lose anychanges since then.If your / and /home are on separate partitions (or btrfs sub-volumes) you willneed to do this for both of them.(if you weren't running btrfs on /, you could do this with rsync instead of'btrfs send', but rsync would be a lot slower)IME, drives are fragile and prone to failure. It's always best to make plansand backup procedures so that WHEN (not IF) a drive fails, you don't loseanything important...or, at least, minimise your losses.Also, remember that RAID is not a substitute for backup so you shouldregularly backup your /data filesystem to tape or other drives. Ideally,you should try to have an off-site backup in case of fire/flood/etc (e.g.backup to an external USB drive and store it at your office, lawyer's safe, afriend's house or somewhere. Have at least two of these so you can rotate theoffsite backups).> After the machine was running I was asked if I had set up the machine using> Ubuntu Server, I hadn't, because at that time I didn't see those options.>> I am thinking, then, for this build, perhaps I should set it up using Ubuntu> Server. I will need to get my system to recognise the RAID drives as well.If the installer doesn't automatically detect your /data btrfs filesystem andadd it to /etc/fstab, it's easy enough to add it yourself.> So before I jump in the deep end again, are there any "gotchas" of which I> should be aware.>> Will the server version make life more reliable?the only significant difference between the server and desktop versions ofubuntu are the packages which are installed by default. e.g. the desktopversion installs a whole bunch of desktop stuff (X, desktop environment andGUI apps, etc) that the server version doesn't. Otherwise, they're the same -same kernel, same libc and other standard system libraries, etc.craig--craig sanders <cas@taz.net.au>_______________________________________________luv-main mailing listluv-main@luv.asn.auhttps://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

On Sat, Jan 18, 2020 at 01:20:39PM +1100, pushin.linux wrote:
I have elected to start with a Ubuntu 18.04 LTS desktop install.The Raid drives were picked up, ie are available, but does the balance command need to be issued again?
You only need to run 'btrfs balance' when you're changing the number and/or size of drives (or partitions) in the btrfs array. The command re-balances all of the data on the array, roughly-equally across all the drives. So, if you're not adding drives to the array, you don't need to re-balance it. (btw, 'btrfs balance' is the one feature that btrfs has that I wish zfs had)
I had two lines to set up the raid and balance them at the start.
IIRC, I think I advised you to do something like: 1. create a degraded btrfs array with just one of the drives; 2. copy your data to it; 3. add another drive to the btrfs array with 'btrfs add'; 4. re-balance the data so that it's on both drives with 'btrfs balance'. If so, that'll be why you have two commands written down.
I suspect that without those commands only one drive will be written to.
nope. This time around, your btrfs array for /data ALREADY EXISTS, so you don't have to do any of that. And you certainly SHOULD NOT run mkfs.btrfs, that would erase your current btrfs array and re-format it. All you need to do this time is add an entry to /etc/fstab so that it mounts correctly on boot. Something like the following: UUID="c0483385-ca6f-abb3-aeeb-94793439a637" /data btrfs defaults,relatime 0 0 run 'blkid' to find the correct uuid for your /data fs and use it instead of the bogus one in the example above. craig -- craig sanders <cas@taz.net.au>

Hi All, Here is my fstab after the install, it seems that my two "RAID" drives are just "dwellers on the threshold" as they do not appear in fstab. alg@andrewg:~$ sudo cat /etc/fstab [sudo] password for alg: # /etc/fstab: static file system information. # # Use 'blkid' to print the universally unique identifier for a # device; this may be used with UUID= as a more robust way to name devices # that works even if disks are added and removed. See fstab(5). # # <file system> <mount point> <type> <options> <dump> <pass> # / was on /dev/sda3 during installation UUID=2dfcd965-625b-47d5-a267-b02276320922 / btrfs defaults,subvol=@ 0 1 # /home was on /dev/sda3 during installation UUID=2dfcd965-625b-47d5-a267-b02276320922 /home btrfs defaults,subvol=@home 0 2 # swap was on /dev/sda2 during installation UUID=b2c6d1c4-4b94-4171-954e-9f5d56704514 none swap sw 0 0 alg@andrewg:~$ Are these two following commands OK to apply to drives that were balanced previously and hold data? sudo btrfs device add -f /dev/sdc1 /data sudo btrfs ballance start -dconvert=raid1 -mconvert=raid1 /data and will issuing those commands write that into fstab? Many thanks Andrew On 18/1/20 12:59 pm, Craig Sanders via luv-main wrote:
On Fri, Jan 17, 2020 at 11:36:29AM +1100, Andrew Greig wrote:
I recently experienced an SSD failure, and so I have purchased another to set up my system again. I received some substantial help from this list early in 2019 to build my machine with this SSD as / and /home under Ubuntu 18.04 with two x 2Tb conventional drives in RAID for storing my work, all are running btrfs. You lost your home dir and the data in it when your SSD failed Because your rootfs and /home on the SSD doesn't have any redundancy (i.e. it was a single partition, with no RAID). I strongly recommend setting up a cron job to regularly snapshot it (at least once/day) and do a 'btrfs send' of that snapshot to a sub-volume of your /data filesystem.
That way you won't lose much data from that partition if your SSD dies again - you can retrieve it from the last snapshot backup, and will only lose any changes since then.
If your / and /home are on separate partitions (or btrfs sub-volumes) you will need to do this for both of them.
(if you weren't running btrfs on /, you could do this with rsync instead of 'btrfs send', but rsync would be a lot slower)
IME, drives are fragile and prone to failure. It's always best to make plans and backup procedures so that WHEN (not IF) a drive fails, you don't lose anything important...or, at least, minimise your losses.
Also, remember that RAID is not a substitute for backup so you should regularly backup your /data filesystem to tape or other drives. Ideally, you should try to have an off-site backup in case of fire/flood/etc (e.g. backup to an external USB drive and store it at your office, lawyer's safe, a friend's house or somewhere. Have at least two of these so you can rotate the offsite backups).
After the machine was running I was asked if I had set up the machine using Ubuntu Server, I hadn't, because at that time I didn't see those options.
I am thinking, then, for this build, perhaps I should set it up using Ubuntu Server. I will need to get my system to recognise the RAID drives as well. If the installer doesn't automatically detect your /data btrfs filesystem and add it to /etc/fstab, it's easy enough to add it yourself.
So before I jump in the deep end again, are there any "gotchas" of which I should be aware.
Will the server version make life more reliable? the only significant difference between the server and desktop versions of ubuntu are the packages which are installed by default. e.g. the desktop version installs a whole bunch of desktop stuff (X, desktop environment and GUI apps, etc) that the server version doesn't. Otherwise, they're the same - same kernel, same libc and other standard system libraries, etc.
craig
-- craig sanders <cas@taz.net.au> _______________________________________________ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

On Sat, Jan 18, 2020 at 01:41:05PM +1100, Andrew Greig wrote:
alg@andrewg:~$ sudo cat /etc/fstab [sudo] password for alg: # /etc/fstab: static file system information. # # Use 'blkid' to print the universally unique identifier for a # device; this may be used with UUID= as a more robust way to name devices # that works even if disks are added and removed. See fstab(5). # # <file system> <mount point> <type> <options> <dump> <pass> # / was on /dev/sda3 during installation UUID=2dfcd965-625b-47d5-a267-b02276320922 / btrfs defaults,subvol=@ 0 1 # /home was on /dev/sda3 during installation UUID=2dfcd965-625b-47d5-a267-b02276320922 /home btrfs defaults,subvol=@home 0 2 # swap was on /dev/sda2 during installation UUID=b2c6d1c4-4b94-4171-954e-9f5d56704514 none swap sw 0 0 alg@andrewg:~$
Are these two following commands OK to apply to drives that were balanced previously and hold data?
sudo btrfs device add -f /dev/sdc1 /data
sudo btrfs ballance start -dconvert=raid1 -mconvert=raid1 /data
No, don't run any of those commands, especially the 'btrfs add' command - you will destroy your existing data array if you run that. Run blkid to list all attached block devices. figure out which one of them is your data array and add an entry if you can't figure out which is the correct one, reply and include blkid's output.
and will issuing those commands write that into fstab?
no. craig -- craig sanders <cas@taz.net.au> BOFH excuse #376: Budget cuts forced us to sell all the power cords for the servers.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hi, Just some thoughts.... Way back, SSDs were expensive and less reliable than today. Given the cost of SSDs today, I would consider even RAIDING the SSDs. btrfs -- I never, ever considered that to be real production ready and I believe that even dead hat has moved away from it somewhat (not sure to what extent). If you are storing photos and videos, then absolute data integrity might not be an issue, but what happens if you need recovery with btrfs failures of any kind? I would think you would be in trouble and will need plenty of help with this. I like the idea of btrfs, but prefer zfs, in any case I just use ext4 over encrypted RAID volumes. Cheers A -----BEGIN PGP SIGNATURE----- iHUEAREIAB0WIQTJAoMHtC6YydLfjUOoFmvLt+/i+wUCXiJ4IQAKCRCoFmvLt+/i +2tMAP9Yn9xMOJAB3Vvo2T1tEFEn2M2vOoLNh97oTl1DSJ2hNgD/QtMnQeKx0/79 5b2T9NXtjSnd6cTzwp18R6ulBE1Y8ss= =6lJW -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hi, On 18/1/20 2:14 pm, Andrew McGlashan via luv-main wrote:
btrfs -- I never, ever considered that to be real production ready and I believe that even dead hat has moved away from it somewhat (not sure to what extent).
Some links, none of which are new as this occurred some time ago now. https://www.theregister.co.uk/2017/08/16/red_hat_banishes_btrfs_from_rhe l https://www.marksei.com/red-hat-deprecates-btrfs-stratis/ https://news.ycombinator.com/item?id=14907771 Oh and one newer link, fwiw. https://access.redhat.com/discussions/3138231 Cheers A. -----BEGIN PGP SIGNATURE----- iHUEAREIAB0WIQTJAoMHtC6YydLfjUOoFmvLt+/i+wUCXiJ81QAKCRCoFmvLt+/i +yxIAQCwTtPSBOsBsJzf/yvs7j+PtVNSgoj2qELV0KbaM7NUUgEA1rBUQAsEdAeC lnxXo58Aw7lE7Qn6M5NzgZxXnbo2R4o= =gI1b -----END PGP SIGNATURE-----

On Saturday, 18 January 2020 2:34:52 PM AEDT Andrew McGlashan via luv-main wrote:
Hi,
On 18/1/20 2:14 pm, Andrew McGlashan via luv-main wrote:
btrfs -- I never, ever considered that to be real production ready and I believe that even dead hat has moved away from it somewhat (not sure to what extent).
Some links, none of which are new as this occurred some time ago now.
I think this link is the most useful. BTRFS has worked quite solidly for me for years. The main deficiency of BTRFS is that RAID-5 and RAID-6 are not usable as of the last reports I read. For a home server RAID-1 is all you need (2 or 3 largish SATA disks in a RAID-1 gives plenty of storage). The way BTRFS allows you to extend a RAID-1 filesystem by adding a new disk of any size and rebalancing is really handy for home use. The ZFS limit of having all disks be the same size and upgraded in lock step is no problem for corporate use. Generally I recommend using BTRFS for workstations and servers that have 2 disks. Use ZFS for big storage. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sun, Jan 19, 2020 at 05:38:23PM +1100, russell@coker.com.au wrote:
Generally I recommend using BTRFS for workstations and servers that have 2 disks. Use ZFS for big storage.
Unless you need to make regular backups from workstations or small servers to a "big storage" ZFS backup server. In that case, use zfs so you can use 'zfs send'. Backups will be completed in a very small fraction of the time they'd take with rsync....the time difference is huge - minutes vs hours. That's fast enough to do them hourly or more frequently if needed, instead of daily. craig -- craig sanders <cas@taz.net.au>

On Monday, 20 January 2020 2:34:09 AM AEDT Craig Sanders via luv-main wrote:
On Sun, Jan 19, 2020 at 05:38:23PM +1100, russell@coker.com.au wrote:
Generally I recommend using BTRFS for workstations and servers that have 2 disks. Use ZFS for big storage.
Unless you need to make regular backups from workstations or small servers to a "big storage" ZFS backup server. In that case, use zfs so you can use 'zfs send'. Backups will be completed in a very small fraction of the time they'd take with rsync....the time difference is huge - minutes vs hours. That's fast enough to do them hourly or more frequently if needed, instead of daily.
It really depends on the type of data. Backing up VM images via rsync is slow because they always have relatively small changes in the middle of large files. Backing up large mail spools can be slow as there's a significant number of accounts with no real changes as well as a good number of accounts with only small changes (like the power users who have 10,000+ old messages stored and only a few new messages at any time because they delete most mail soon after it arrives). But even for those corner cases rsync will work if your data volume isn't too big. For other cases it works pretty well. I guess you have to trade off the features of using one filesystem everywhere vs the ability to run filesystems independently of what applications will run on top. I like the freedom to use whichever filesystem best suits the server. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Tue, Jan 28, 2020 at 08:06:18PM +1100, russell@coker.com.au wrote:
On Monday, 20 January 2020 2:34:09 AM AEDT Craig Sanders via luv-main wrote:
[ paraphrased from memory because I deleted it: Russell said ] [ something about using btrfs on small boxes, and zfs only on ] [ big storage servers ]
Unless you need to make regular backups from workstations or small servers to a "big storage" ZFS backup server. In that case, use zfs so you can use 'zfs send'. Backups will be completed in a very small fraction of the time they'd take with rsync....the time difference is huge - minutes vs hours. That's fast enough to do them hourly or more frequently if needed, instead of daily.
It really depends on the type of data.
No, it really doesn't.
Backing up VM images via rsync is slow because they always have relatively small changes in the middle of large files.
rsyncing **ANY** large set of data is slow, whether it's huge files like VM images or millions of small files (e.g. on a mail server). rsync has to check at least the file sizes and timestamps, and then the block checksums on every run. On large sets, this WILL take many hours, no matter how much or how little has actually changed. 'zfs send' and 'btrfs send' already know exactly which blocks have changed and they just send those blocks, no need for checking. Why? Because a snapshot is effectively just a list of blocks in use at a particular point in time. COW ensures that if a file is created or changed or deleted, the set of blocks in the next snapshot will be different. (a minor benefit of this is that if a file or directory is moved to another directory in the same dataset, the only blocks that actually changed were the blocks containing the directory info, so they're the only blocks that need be sent. rsync, however, would send the entire directory contents because it's all "new" data. Transparent compression also helps 'zfs send' - compressed data requires fewer blocks to storer it....rsync, though, can't benefit from transparent compression as it has to compare the source file's *uncompressed* data with the target copy) rsync is still useful as a tool for moving/copying data from one location to another (whether on the same machine or to a different machine), but it's no longer a good choice for backups. it just takes too long - by the time it has finished, the source data will have changed. It's an improved "cp". I guess it's also still useful for backing up irrelevant machines like those running MS Windows. But they should be storing important data on the file server anyway, so they can be blown away and re-imaged whenever required.
I guess you have to trade off the features of using one filesystem everywhere vs the ability to run filesystems independently of what applications will run on top. I like the freedom to use whichever filesystem best suits the server.
I prefer to use the filesystem that's best for all machines on the network. If ZFS is in use on the file-server or backup-server, then that means zfs on everything else. If it's btrfs on the server, then it should be btrfs on everything. send/receive alone are worth putting in the time & effort to standardise, and both zfs & btrfs also offer many more very useful features. And if neither is currently in use, then that means scheduling appropriate times & days to convert everything over to ZFS, starting with the server(s). btrfs is not an option here because it just isn't as good as zfs...if i'm going to go to all that trouble and hassle, i may as well get the most/best benefit in exchange. craig -- craig sanders <cas@taz.net.au>

On Thursday, 30 January 2020 6:05:56 PM AEDT Craig Sanders via luv-main wrote:
It really depends on the type of data.
No, it really doesn't.
Backing up VM images via rsync is slow because they always have relatively small changes in the middle of large files.
rsyncing **ANY** large set of data is slow, whether it's huge files like VM images or millions of small files (e.g. on a mail server).
Here's what I wrote previously: # It really depends on the type of data. Backing up VM images via rsync is # slow because they always have relatively small changes in the middle of # large files. Backing up large mail spools can be slow as there's a # significant number of accounts with no real changes as well as a good number # of accounts with only small changes (like the power users who have 10,000+ # old messages stored and only a few new messages at any time because they # delete most mail soon after it arrives). But even for those corner cases # rsync will work if your data volume isn't too big. For other cases it works # pretty well. I've used rsync to backup mail spools with up to about 20,000 accounts. Not big mail stores and only doing a backup twice a week. The regular backups (for users deleting the wrong messages) were ZFS snapshots.
rsync has to check at least the file sizes and timestamps, and then the block checksums on every run. On large sets, this WILL take many hours, no matter how much or how little has actually changed.
It's all a matter of scale. I just did a test on a workstation with about 100G of storage in BTRFS. The usual backups are weekly on Sunday night. A run now took 28 minutes (copying 5 days of data). A run immediately after (just rsync checking file dates) took 65 seconds. I could set that machine to have a backup every hour over the Internet if I wanted to.
(a minor benefit of this is that if a file or directory is moved to another directory in the same dataset, the only blocks that actually changed were the blocks containing the directory info, so they're the only blocks that need be sent. rsync, however, would send the entire directory contents
Yes, that's good for that case. Not a common case I deal with.
because it's all "new" data. Transparent compression also helps 'zfs send' - compressed data requires fewer blocks to storer it....rsync, though, can't benefit from transparent compression as it has to compare the source file's *uncompressed* data with the target copy)
Rsync compares the checksums of the uncompressed data. Then sends compressed data if you use the -z option, and if you have ssh configured to use compression then that applies too.
rsync is still useful as a tool for moving/copying data from one location to another (whether on the same machine or to a different machine), but it's no longer a good choice for backups. it just takes too long - by the time it has finished, the source data will have changed. It's an improved "cp".
That depends on what you are backing up. Rsync is a well known program, it doesn't require any special setup or testing. The BTRFS and ZFS programs for sending changes would require more testing.
I prefer to use the filesystem that's best for all machines on the network.
If ZFS is in use on the file-server or backup-server, then that means zfs on everything else. If it's btrfs on the server, then it should be btrfs on everything.
Except if you have some systems storing large data that needs RAID-Z and some systems that need the flexibility that BTRFS offers.
btrfs is not an option here because it just isn't as good as zfs...if i'm
Unless you want to have a RAID-1 array that can have disks added to it or removed from it at any time and of any size. This is a useful feature for a home server and something ZFS doesn't support. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sat, Jan 18, 2020 at 02:14:46PM +1100, Andrew McGlashan wrote:
Just some thoughts....
Way back, SSDs were expensive and less reliable than today.
Given the cost of SSDs today, I would consider even RAIDING the SSDs.
If it's physically possible to install a second SSD of the same storage capacity or larger then he absolutely should do so. I vaguely recall suggesting he should get a second SSD for the rootfs ages ago, but my understanding / assumption was that there was only physical space and connectors for one SSD in the machine. The 'btrfs snapshot' + 'btrfs send' suggestion was just a way of regularly backing up a single-drive btrfs filesystem onto his raid-1 btrfs array so that little or nothing was lost in case of another drive failure. It's less than ideal, but a LOT better than nothing. I personally would never use anything less than RAID-1 (or equivalent, such as a mirrored pair on zfs) for any storage. Which means, of course, that I'm used to paying double for my storage capacity - i can't just buy one, I have to buy a pair. Not as a substitute for regular backups, but for convenience when only one drive of a pair has died. Drives die, and the time & inconvenience of dealing with that (and the lost data) cost far more than the price of a second drive for raid-1/mirror. craig -- craig sanders <cas@taz.net.au>

Hi Craig, Yes, the problem was my Motherboard would not handle enough disks, and we did Format sdc with btrfs and left the sdb alone so that btrfs could arrange things between them. I was hoping to get an understanding of how the RAID drives remembered the "Balance" command when the the whole of the root filesystem was replaced on a new SSD. I thought that control would have rested with /etc/fstab. How do the drives know to balance themselves, is there a command resident in sdc1? My plan is to have auto backups, and given that my activity has seen an SSD go down in 12 months, maybe at 10 months I should build a new box, something which will handle 64Gb RAM and have a decent Open Source Graphics driver. And put the / on a pair of 1Tb SSDs. Many thanks Andrew On 18/1/20 6:44 pm, Craig Sanders via luv-main wrote:
On Sat, Jan 18, 2020 at 02:14:46PM +1100, Andrew McGlashan wrote:
Just some thoughts....
Way back, SSDs were expensive and less reliable than today.
Given the cost of SSDs today, I would consider even RAIDING the SSDs. If it's physically possible to install a second SSD of the same storage capacity or larger then he absolutely should do so. I vaguely recall suggesting he should get a second SSD for the rootfs ages ago, but my understanding / assumption was that there was only physical space and connectors for one SSD in the machine.
The 'btrfs snapshot' + 'btrfs send' suggestion was just a way of regularly backing up a single-drive btrfs filesystem onto his raid-1 btrfs array so that little or nothing was lost in case of another drive failure. It's less than ideal, but a LOT better than nothing.
I personally would never use anything less than RAID-1 (or equivalent, such as a mirrored pair on zfs) for any storage. Which means, of course, that I'm used to paying double for my storage capacity - i can't just buy one, I have to buy a pair. Not as a substitute for regular backups, but for convenience when only one drive of a pair has died.
Drives die, and the time & inconvenience of dealing with that (and the lost data) cost far more than the price of a second drive for raid-1/mirror.
craig
-- craig sanders <cas@taz.net.au> _______________________________________________ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

On Sat, Jan 18, 2020 at 11:06:50PM +1100, Andrew Greig wrote:
Yes, the problem was my Motherboard would not handle enough disks, and we did Format sdc with btrfs and left the sdb alone so that btrfs could arrange things between them.
I was hoping to get an understanding of how the RAID drives remembered the "Balance" command when the the whole of the root filesystem was replaced on a new SSD.
Your rootfs and your /data filesystem(*) are entirely separate. Don't confuse them. The /data filesystem needed to be re-balanced when you added the second drive (making it into a raid-1 array). 'btrfs balance' reads and rewrites all the existing data on a btrfs filesystem so that it is distributed equally over all drives in the array. For RAID-1, that means mirroring all the data on the first drive onto the second, so that there's a redundant copy of everything. Your rootfs is only a single partition, it doesn't have a raid-1 mirror, so re-balancing isn't necessary (and would do nothing). BTW, there's nothing being "remembered". 'btrfs balance' just re-balances the existing data over all drives in the array. It's a once-off operation that runs to completion and then exits. All **NEW** data will be automatically distributed across the array. If you ever add another drive to the array, or convert it to raid-0 (definitely NOT recommended), you'll need to re-balance it again. until and unless that happens you don't need to even think about re-balancing, it's no longer relevant. (*) I think you had your btrfs raid array mounted at /data, but I may be mis-remembering that. To the best of my knowledge, you have two entirely separate btrfs filesystems - one is the root filesystem, mounted as / (it also has /home on it, which IIRC you have made a separate btrfs sub-volume for). Anyway, it's a single-partition btrfs fs with no raid. The other is a 2 drive btrfs fs using raid-1, which I think is mounted as /data.
I thought that control would have rested with /etc/fstab. How do the drives know to balance themselves, is there a command resident in sdc1?
/etc/fstab tells the system which filesystems to mount. It gets read at boot time by the system start up scripts.
My plan is to have auto backups, and given that my activity has seen an SSD go down in 12 months, maybe at 10 months I should build a new box, something which will handle 64Gb RAM and have a decent Open Source Graphics driver. And put the / on a pair of 1Tb SSDs.
That would be a very good idea. Most modern motherboards will have more than enough NVME and SATA slots for that (e.g. most Ryzen x570 motherboards have 2 or 3 NVME slots for extremely fast SSDs, plus 6 or 8 SATA ports for SATA HDDs and SSDs. They also have enough RAM slots for 64GB DDR-4 RAM, and have at least 2 or 3 PCI-e v4 slots - you'll use one for your graphics card). 2 SSDs for the rootfs including your home dir, and 2 HDDs for your /data bulk storage filesystem. And more than enough drive ports for future expansion if you ever need it. ----------------------- some info on nvme vs sata: NVME SSDs are **much** faster then SATA SSDs. SATA 3 is 6 Gbps (600 MBps), so taking protocol overhead into account SATA drives max out at around 550 MBps. NVME drives run at **up to** PCI-e bus speeds - with 4 lanes, that's a little under 40 Gbps for PCIe v3 (approx 4000 MBps minus protocol overhead), double that for PCIe v4. That's the theoretical maximum speed, anyway. In practice, most NVME SSDs run quite a bit slower than that, about 2 GBps - that's still almost 4 times as fast as a SATA SSD. Some brands and models (e.g. those from samsung and crucial) run at around 3200 to 3500 MBps, but they cost more (e.g. a 1TB Samsung 970 EVO PLUS (MZ-V7S1T0BW) costs around $300, while the 1TB Kingston A2000 (SA2000M8/1000G) costs around $160 but is only around 1800 MBps). AFAIK there are no NVME drives that run at full PCI-e v4 speed (~8 GBps with 4 lanes) yet, it's still too new. That's not a problem, PCI-e is designed to be backwards-compatible with earlier versions, so any current NVME drive will work in pcie v4 slots. NVME SSDs cost about the same as SATA SSDs of the same capacity so there's no reason not to get them if your motherboard has NVME slots (which are pretty much standard these days). BTW, the socket that NVME drives plug into is called "M.2". M.2 supports both SATA & NVME protocols. SATA M.2 runs at 6 Gbps. NVME runs at PCI-e bus speed. So you have to be careful when you buy to make sure you get an NVME M.2 drive and not a SATA drive in M.2 form-factor...some retailers will try to exploit the confusion over this. craig -- craig sanders <cas@taz.net.au>

Thanks Craig, As they say in the Medibank commercial "I feel better now!" Andrew On 19/1/20 3:47 pm, Craig Sanders via luv-main wrote:
On Sat, Jan 18, 2020 at 11:06:50PM +1100, Andrew Greig wrote:
Yes, the problem was my Motherboard would not handle enough disks, and we did Format sdc with btrfs and left the sdb alone so that btrfs could arrange things between them.
I was hoping to get an understanding of how the RAID drives remembered the "Balance" command when the the whole of the root filesystem was replaced on a new SSD. Your rootfs and your /data filesystem(*) are entirely separate. Don't confuse them.
The /data filesystem needed to be re-balanced when you added the second drive (making it into a raid-1 array). 'btrfs balance' reads and rewrites all the existing data on a btrfs filesystem so that it is distributed equally over all drives in the array. For RAID-1, that means mirroring all the data on the first drive onto the second, so that there's a redundant copy of everything.
Your rootfs is only a single partition, it doesn't have a raid-1 mirror, so re-balancing isn't necessary (and would do nothing).
BTW, there's nothing being "remembered". 'btrfs balance' just re-balances the existing data over all drives in the array. It's a once-off operation that runs to completion and then exits. All **NEW** data will be automatically distributed across the array. If you ever add another drive to the array, or convert it to raid-0 (definitely NOT recommended), you'll need to re-balance it again. until and unless that happens you don't need to even think about re-balancing, it's no longer relevant.
(*) I think you had your btrfs raid array mounted at /data, but I may be mis-remembering that. To the best of my knowledge, you have two entirely separate btrfs filesystems - one is the root filesystem, mounted as / (it also has /home on it, which IIRC you have made a separate btrfs sub-volume for). Anyway, it's a single-partition btrfs fs with no raid. The other is a 2 drive btrfs fs using raid-1, which I think is mounted as /data.
I thought that control would have rested with /etc/fstab. How do the drives know to balance themselves, is there a command resident in sdc1? /etc/fstab tells the system which filesystems to mount. It gets read at boot time by the system start up scripts.
My plan is to have auto backups, and given that my activity has seen an SSD go down in 12 months, maybe at 10 months I should build a new box, something which will handle 64Gb RAM and have a decent Open Source Graphics driver. And put the / on a pair of 1Tb SSDs. That would be a very good idea. Most modern motherboards will have more than enough NVME and SATA slots for that (e.g. most Ryzen x570 motherboards have 2 or 3 NVME slots for extremely fast SSDs, plus 6 or 8 SATA ports for SATA HDDs and SSDs. They also have enough RAM slots for 64GB DDR-4 RAM, and have at least 2 or 3 PCI-e v4 slots - you'll use one for your graphics card).
2 SSDs for the rootfs including your home dir, and 2 HDDs for your /data bulk storage filesystem. And more than enough drive ports for future expansion if you ever need it.
-----------------------
some info on nvme vs sata:
NVME SSDs are **much** faster then SATA SSDs. SATA 3 is 6 Gbps (600 MBps), so taking protocol overhead into account SATA drives max out at around 550 MBps.
NVME drives run at **up to** PCI-e bus speeds - with 4 lanes, that's a little under 40 Gbps for PCIe v3 (approx 4000 MBps minus protocol overhead), double that for PCIe v4. That's the theoretical maximum speed, anyway. In practice, most NVME SSDs run quite a bit slower than that, about 2 GBps - that's still almost 4 times as fast as a SATA SSD.
Some brands and models (e.g. those from samsung and crucial) run at around 3200 to 3500 MBps, but they cost more (e.g. a 1TB Samsung 970 EVO PLUS (MZ-V7S1T0BW) costs around $300, while the 1TB Kingston A2000 (SA2000M8/1000G) costs around $160 but is only around 1800 MBps).
AFAIK there are no NVME drives that run at full PCI-e v4 speed (~8 GBps with 4 lanes) yet, it's still too new. That's not a problem, PCI-e is designed to be backwards-compatible with earlier versions, so any current NVME drive will work in pcie v4 slots.
NVME SSDs cost about the same as SATA SSDs of the same capacity so there's no reason not to get them if your motherboard has NVME slots (which are pretty much standard these days).
BTW, the socket that NVME drives plug into is called "M.2". M.2 supports both SATA & NVME protocols. SATA M.2 runs at 6 Gbps. NVME runs at PCI-e bus speed. So you have to be careful when you buy to make sure you get an NVME M.2 drive and not a SATA drive in M.2 form-factor...some retailers will try to exploit the confusion over this.
craig
-- craig sanders <cas@taz.net.au> _______________________________________________ luv-main mailing list luv-main@luv.asn.au https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

Hi Craig On 19/1/20 3:47 pm, Craig Sanders via luv-main wrote
That would be a very good idea. Most modern motherboards will have more than enough NVME and SATA slots for that (e.g. most Ryzen x570 motherboards have 2 or 3 NVME slots for extremely fast SSDs, plus 6 or 8 SATA ports for SATA HDDs and SSDs. They also have enough RAM slots for 64GB DDR-4 RAM, and have at least 2 or 3 PCI-e v4 slots - you'll use one for your graphics card).
2 SSDs for the rootfs including your home dir, and 2 HDDs for your /data bulk storage filesystem. And more than enough drive ports for future expansion if you ever need it.
-----------------------
some info on nvme vs sata:
NVME SSDs are **much** faster then SATA SSDs. SATA 3 is 6 Gbps (600 MBps), so taking protocol overhead into account SATA drives max out at around 550 MBps.
NVME drives run at **up to** PCI-e bus speeds - with 4 lanes, that's a little under 40 Gbps for PCIe v3 (approx 4000 MBps minus protocol overhead), double that for PCIe v4. That's the theoretical maximum speed, anyway. In practice, most NVME SSDs run quite a bit slower than that, about 2 GBps - that's still almost 4 times as fast as a SATA SSD.
Some brands and models (e.g. those from samsung and crucial) run at around 3200 to 3500 MBps, but they cost more (e.g. a 1TB Samsung 970 EVO PLUS (MZ-V7S1T0BW) costs around $300, while the 1TB Kingston A2000 (SA2000M8/1000G) costs around $160 but is only around 1800 MBps).
AFAIK there are no NVME drives that run at full PCI-e v4 speed (~8 GBps with 4 lanes) yet, it's still too new. That's not a problem, PCI-e is designed to be backwards-compatible with earlier versions, so any current NVME drive will work in pcie v4 slots.
NVME SSDs cost about the same as SATA SSDs of the same capacity so there's no reason not to get them if your motherboard has NVME slots (which are pretty much standard these days).
BTW, the socket that NVME drives plug into is called "M.2". M.2 supports both SATA & NVME protocols. SATA M.2 runs at 6 Gbps. NVME runs at PCI-e bus speed. So you have to be careful when you buy to make sure you get an NVME M.2 drive and not a SATA drive in M.2 form-factor...some retailers will try to exploit the confusion over this.
craig
--
Hi Craig here is the output of blkid /dev/sdb1: LABEL="Data" UUID="73f55e83-2038-4a0d-9c05-8f7e2e741517" UUID_SUB="77fdea4e-3157-45af-bba4-7db8eb04ff08" TYPE="btrfs" PARTUUID="d5d96658-01" /dev/sdc1: LABEL="Data" UUID="73f55e83-2038-4a0d-9c05-8f7e2e741517" UUID_SUB="8ad739f7-675e-4aeb-ab27-299b34f6ace5" TYPE="btrfs" PARTUUID="a1948e65-01" I tried the first UUID for sdc1 and the machine hung but gave me an opportunity to edit the fstab and reboot. When checking the UUID I discovered that the first entry for both drives were identical. Should I be using the SUB UUID for sdc1 for the entry in fstab? Kind regards Andrew

On Sun, Jan 19, 2020 at 04:48:30PM +1100, Andrew Greig wrote:
here is the output of blkid
/dev/sdb1: LABEL="Data" UUID="73f55e83-2038-4a0d-9c05-8f7e2e741517" UUID_SUB="77fdea4e-3157-45af-bba4-7db8eb04ff08" TYPE="btrfs" PARTUUID="d5d96658-01" /dev/sdc1: LABEL="Data" UUID="73f55e83-2038-4a0d-9c05-8f7e2e741517" UUID_SUB="8ad739f7-675e-4aeb-ab27-299b34f6ace5" TYPE="btrfs" PARTUUID="a1948e65-01"
I tried the first UUID for sdc1 and the machine hung but gave me an opportunity to edit the fstab and reboot.
That should work. Are you sure you typed or copy-pasted the UUID correctly? The fstab entry should look something like this: UUID="73f55e83-2038-4a0d-9c05-8f7e2e741517" /data btrfs defaults 0 0 edit /etc/fstab so that it looks like that and then (as root) run "mount /data". If that works manually on the command line, it will work when the machine reboots.
When checking the UUID I discovered that the first entry for both drives were identical.
yes, that's normal. they're both members of the same btrfs array.
Should I be using the SUB UUID for sdc1 for the entry in fstab?
No, you should use the UUID. Alternatively, you could use ONE of the PARTUUID values. e.g. one of: PARTUUID="d5d96658-01" /data btrfs defaults 0 0 PARTUUID="a1948e65-01" /data btrfs defaults 0 0 craig PS: I just tested several variations on this on my btrfs testing VM. UUID works. PARTUUID works. /etc/fstab does not support UUID_SUB (and it isn't mentioned in `man fstab`). -- craig sanders <cas@taz.net.au>

On Sunday, 19 January 2020 3:47:00 PM AEDT Craig Sanders via luv-main wrote:
NVME SSDs are **much** faster then SATA SSDs. SATA 3 is 6 Gbps (600 MBps), so taking protocol overhead into account SATA drives max out at around 550 MBps.
NVME drives run at **up to** PCI-e bus speeds - with 4 lanes, that's a little under 40 Gbps for PCIe v3 (approx 4000 MBps minus protocol overhead), double that for PCIe v4. That's the theoretical maximum speed, anyway. In practice, most NVME SSDs run quite a bit slower than that, about 2 GBps - that's still almost 4 times as fast as a SATA SSD.
Some brands and models (e.g. those from samsung and crucial) run at around 3200 to 3500 MBps, but they cost more (e.g. a 1TB Samsung 970 EVO PLUS (MZ-V7S1T0BW) costs around $300, while the 1TB Kingston A2000 (SA2000M8/1000G) costs around $160 but is only around 1800 MBps).
Until recently I had a work Thinkpad with NVMe. That could sustain almost 5GB/s until the CPU overheated and throttled it (there was an ACPI bug that caused it to falsely regard 60C as a thermal throttle point instead of 80C). But when it came to random writes the speed was much lower, particularly with sustained writes. Things like upgrading a Linux distribution in a VM image causes sustained write rates to go well below 1GB/s. The NVMe interface is good, but having a CPU and storage that can sustain it is another issue. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Saturday, 18 January 2020 6:44:51 PM AEDT Craig Sanders via luv-main wrote:
I personally would never use anything less than RAID-1 (or equivalent, such as a mirrored pair on zfs) for any storage. Which means, of course, that I'm used to paying double for my storage capacity - i can't just buy one, I have to buy a pair. Not as a substitute for regular backups, but for convenience when only one drive of a pair has died.
Drives die, and the time & inconvenience of dealing with that (and the lost data) cost far more than the price of a second drive for raid-1/mirror.
I generally agree that RAID-1 is the way to go. But if you can't do that then BTRFS "dup" and ZFS "copies=2" are good options, especially with SSD. So far I have not seen a SSD entirely die, the worst I've seen is a SSD stop accepting writes (which causes an immediate kernel panic with a filesystem like BTRFS). I've also seen SSDs return corrupt data while claiming it to be good, but not in huge quantities. For hard drives also I haven't seen a total failure (like stiction) for many years. The worst hard drive problem I've seen was about 12,000 read errors, that sounds like a lot but is a very small portion of a 3TB disk and "dup" or "copies=2" should get most of your data back in that situation. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sun, Jan 19, 2020 at 05:34:46PM +1100, russell@coker.com.au wrote:
I generally agree that RAID-1 is the way to go. But if you can't do that then BTRFS "dup" and ZFS "copies=2" are good options, especially with SSD.
I don't see how that's the case, how it can help much (if at all). Making a second copy of the data on the same drive that's failing doesn't add much redundancy, but does add significantly to the drive's workload (increasing the risk of failure). It might be ok on a drive with only a few bad sectors or in conjunction with some kind of RAID, but it's not a substitute for RAID.
So far I have not seen a SSD entirely die, the worst I've seen is a SSD stop
I haven't either, but I've heard & read of it. Andrew's rootfs SSD seems to have died (or possibly just corrupted so badly it can't be mounted. i'm not sure) I've seen LOTS of HDDs die. Even at home I've had dozens die on me over the years - I've got multiple stacks of dead drives of various ages and sizes cluttering up shelves (mostly waiting for me to need another fridge magnet or shiny coffee-cup coaster :)
I've also seen SSDs return corrupt data while claiming it to be good, but not in huge quantities.
That's one of the things that btrfs and zfs can detect...and correct if there's any redundancy in the storage.
For hard drives also I haven't seen a total failure (like stiction) for many years. The worst hard drive problem I've seen was about 12,000 read errors, that sounds like a lot but is a very small portion of a 3TB disk and "dup" or "copies=2" should get most of your data back in that situation.
If a drive is failing, all the read or write re-tries kill performance on a zpool, and that drive will eventually be evicted from the pool. Lose enough drives, and your pool goes from "DEGRADED" to "FAILED", and your data goes with it. craig -- craig sanders <cas@taz.net.au>

On Monday, 20 January 2020 2:08:44 AM AEDT Craig Sanders via luv-main wrote:
On Sun, Jan 19, 2020 at 05:34:46PM +1100, russell@coker.com.au wrote:
I generally agree that RAID-1 is the way to go. But if you can't do that then BTRFS "dup" and ZFS "copies=2" are good options, especially with SSD.
I don't see how that's the case, how it can help much (if at all). Making a second copy of the data on the same drive that's failing doesn't add much redundancy, but does add significantly to the drive's workload (increasing the risk of failure).
It might be ok on a drive with only a few bad sectors or in conjunction with some kind of RAID, but it's not a substitute for RAID.
Having a storage device fail entirely seems like a rare occurance. The only time it happened to me in the last 5 years is a SSD that stopped accepting writes (reads still mostly worked OK). I've had a couple of SSDs have checksum errors recently and a lot of hard drives have checksum errors. Checksum errors (where the drive returns what it considers good data but BTRFS or ZFS regard as bad data) are by far the most common failures I see of the 40+ storage devices I'm running in recent times. BTRFS "dup" and ZFS "copies=2" would cover almost all storage hardware issues that I've seen in the last 5+ years.
So far I have not seen a SSD entirely die, the worst I've seen is a SSD stop I haven't either, but I've heard & read of it. Andrew's rootfs SSD seems to have died (or possibly just corrupted so badly it can't be mounted. i'm not sure)
I've seen LOTS of HDDs die. Even at home I've had dozens die on me over the years - I've got multiple stacks of dead drives of various ages and sizes cluttering up shelves (mostly waiting for me to need another fridge magnet or shiny coffee-cup coaster :)
I've seen them die in the past. But recently they seem to just give increasing error counts. Maybe if I ran a disk that was giving ZFS or BTRFS checksum errors for another few years it might die entirely, but I generally have such disks discarded or drastically repurposed after getting ~40 checksum errors.
For hard drives also I haven't seen a total failure (like stiction) for many years. The worst hard drive problem I've seen was about 12,000 read errors, that sounds like a lot but is a very small portion of a 3TB disk and "dup" or "copies=2" should get most of your data back in that situation. If a drive is failing, all the read or write re-tries kill performance on a zpool, and that drive will eventually be evicted from the pool. Lose enough drives, and your pool goes from "DEGRADED" to "FAILED", and your data goes with it.
So far I haven't seen that happen on my ZFS servers. I have replaced at least 20 disks in zpools due to excessive checksum errors. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Tue, Jan 28, 2020 at 08:02:15PM +1100, russell@coker.com.au wrote:
Having a storage device fail entirely seems like a rare occurance. The only time it happened to me in the last 5 years is a SSD that stopped accepting writes (reads still mostly worked OK).
it's not rare at all, but a drive doesn't have to be completely non-responsive to be considered "dead". It just has to consistently cause enough errors that it results in the pool being degraded. I recently had a seagate ironwolf 4TB drive that would consistently cause problems in my "backup" pool (8TB in two mirrored pairs of 4TB drives, i.e. RAID-10, containing 'zfs send' backups of all my other machines). Whenever it was under moderately heavy load, it would cause enough errors to be kicked, degrading the pool. I didn't have a spare drive to replace it immediately, so just "zpool clear"-ed it several times. Running a scrub on that pool with that drive was guaranteed to degrade the pool within minutes. and, yeah, i moved it around to different SATA & SAS ports just in case it was the port and not the drive. nope. it was the drive. To me, that's a dead drive because it's not safe to use. it can not be trusted to reliably store data. it is junk. the only good use for it is to scrap it for the magnets. (and, btw, that's why I use ZFS and used to use RAID. Without redundancy from RAID-[156Z] or similar, such a drive would result in data loss. Even worse, without the error detection and correction from ZFS, such a drive would result in data corruption).
I've had a couple of SSDs have checksum errors recently and a lot of hard drives have checksum errors. Checksum errors (where the drive returns what it considers good data but BTRFS or ZFS regard as bad data) are by far the most common failures I see of the 40+ storage devices I'm running in recent times.
a drive that consistently returns bad data is not fit for purpose. it is junk. it is a dead drive.
BTRFS "dup" and ZFS "copies=2" would cover almost all storage hardware issues that I've seen in the last 5+ years.
IMO, two copies of data on a drive you can't trust isn't significantly better or more useful than one copy. It's roughly equivalent to making a photocopy of your important documents and then putting both copies in the same soggy cardboard box in a damp cellar. If you want redundancy, use two or more drives. Store your important documents in two or more different locations. and backup regularly.
If a drive is failing, all the read or write re-tries kill performance on a zpool, and that drive will eventually be evicted from the pool. Lose enough drives, and your pool goes from "DEGRADED" to "FAILED", and your data goes with it.
So far I haven't seen that happen on my ZFS servers. I have replaced at least 20 disks in zpools due to excessive checksum errors.
I've never had a pool go to FAILED state, either. I've had pools go to DEGRADED *lots* of times. And almost every time it comes after massive performance drops due to retries - which can be seen in the kernel logs. Depending on the brand, you can also clearly hear the head re-seeking as it tries again and again to read from the bad sector. More importantly, it's not difficult or unlikely for a pool go from being merely DEGRADED to FAILED. A drive doesn't have to fail entirely for it be kicked out of the pool, and if you have enough drives kicked out of a vdev or a pool (2 drives for mirror or raidz-1, 3 for raidz-2, 4 for raidz-3), then that entire vdev is FAILED, not just DEGRADED, and the entire pool will likely be FAILED(*) as a result. That's what happens when there are not enough working drives in a vdev to store the data that's supposed to be stored on it. And the longer you wait to replace a dead/faulty drive, the more likely it is that another drive will die while the pool is degraded. Which is why best practise is to replace the drive ASAP...and also why zfs and some other raid/raid-like HW & SW support "spare" devices to automatically replace them. (*) there are some pool layouts that are resistant (but not immune) to failing - e.g. a mirror of any vdev with redundancy, such as a mirrored pair of raidz vdevs. which is why RAID of any kind is not a substitute for backups. craig -- craig sanders <cas@taz.net.au>

On Thursday, 30 January 2020 5:14:22 PM AEDT Craig Sanders via luv-main wrote:
On Tue, Jan 28, 2020 at 08:02:15PM +1100, russell@coker.com.au wrote:
Having a storage device fail entirely seems like a rare occurance. The only time it happened to me in the last 5 years is a SSD that stopped accepting writes (reads still mostly worked OK).
it's not rare at all, but a drive doesn't have to be completely non-responsive to be considered "dead". It just has to consistently cause enough errors that it results in the pool being degraded.
In recent times I've only had one disk that had such a large amount of errors, a 4TB (from memory) disk with about 12,000 errors. ~12,000 errors out of ~1,000,000,000 blocks (4K block size) means about 0.0012% errors. ZFS with copies=2 on that seems quite likely to give a good amount of your data back.
To me, that's a dead drive because it's not safe to use. it can not be trusted to reliably store data. it is junk. the only good use for it is to scrap it for the magnets.
I've had about a dozen disks in the last ~5 years that would give about 20 ZFS checksum errors a month. I got them replaced with that level of errors, who knows that they might have done if they had remained in service. Presumably if the system in question had run Ext4 we would have discovered the answer to that question.
I've had a couple of SSDs have checksum errors recently and a lot of hard drives have checksum errors. Checksum errors (where the drive returns what it considers good data but BTRFS or ZFS regard as bad data) are by far the most common failures I see of the 40+ storage devices I'm running in recent times.
a drive that consistently returns bad data is not fit for purpose. it is junk. it is a dead drive.
That's my opinion too. But sometimes the people who pay have different opinions and are happy to tolerate a small number of checksum errors.
BTRFS "dup" and ZFS "copies=2" would cover almost all storage hardware issues that I've seen in the last 5+ years.
IMO, two copies of data on a drive you can't trust isn't significantly better or more useful than one copy. It's roughly equivalent to making a photocopy of your important documents and then putting both copies in the same soggy cardboard box in a damp cellar.
If a disk gets 20 checksum errors per month out of 6TB or more of storage then the probability of 2 of those checksum errors hitting the same block is very low, even on BTRFS which I believe has a fairly random allocation for dup. I believe that ZFS is designed to allocate data to reduce the possibility of somewhat random errors taking out multiple copies of data but haven't investigated the details. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/
participants (6)
-
Andrew Greig
-
Andrew McGlashan
-
Colin Fee
-
Craig Sanders
-
pushin.linux
-
Russell Coker