Re: Biting the bullet - RAID

21 May 2018

      On Mon, May 21, 2018 at 05:23:39PM +1000, pushin.linux wrote:
...
Reply to list wasnt offered in my phone. Apologies.
no problem.  I'll reply back to the list, so it goes to the right place.
...
My photographic data is critical. Music, videos etc are unimportant.
You've probably heard this before but:

       ******************************************
       ******************************************
       **                                      **
       ** RAID IS NOT A SUBSTITUTE FOR BACKUP! **
       **                                      **
       ******************************************
       ******************************************

RAID is convenient, and it allows your system to keep going without having to
restore from backup but you will still need to backup your photos and other
important data regularly.
...
I could buy another 2 Tb drive, but what to do with the 1Tb drive. I thought
I could have bare system running onthe 1Tb and all storage on a RAID pair.
What you have will work fine, there's nothing wrong with it.  I just think
that you're better off having your OS disk on some form of RAID as well.

The easiest way to do that is to just get another 1TB drive (approx $60).
Then you'd have two mirrored drives, one for OS & home dir and other stuff,
and one for your photos.

If you're using ZFS, you could even set it up so that you have a combined pool
with two mirroed pairs (2 x 1TB drives and 2 x 2TB), giving a total of 3TB
shared between OS and your photos.  This is probably the most flexible setup.
LVM would also allow you to combine the storage, but it's quite a bit more
work and more complicated to set up.

BTW, just to state the obvious - each mirrored pair of drives should be the
same size (if they're different, you'll only get the capacity of the smallest
drive in the pair), but you can have multiple mirrors of different sizes in a
pool.

e.g. here's the root zfs pool on my main system.  It has the OS, my home
directories, and some other stuff on it.  Most of my data is on a second 4 TB
pool, and this machine also has an 8TB pool called "backup" which has regular
(hourly, daily, weekly, monthly) snapshotted backups of every machine on my
home network.

# zpool status ganesh
  pool: ganesh
 state: ONLINE
  scan: scrub repaired 0B in 0h10m with 0 errors on Sat Apr 28 02:10:27 2018
config:

    NAME                                               STATE     READ WRITE CKSUM
    ganesh                                             ONLINE       0     0     0
      mirror-0                                         ONLINE       0     0     0
        ata-Crucial_CT275MX300SSD1_163313AADD8A-part5  ONLINE       0     0     0
        ata-Crucial_CT275MX300SSD1_163313AAEE5F-part5  ONLINE       0     0     0
      mirror-1                                         ONLINE       0     0     0
        ata-Crucial_CT275MX300SSD1_163313AAF850-part5  ONLINE       0     0     0
        ata-Crucial_CT275MX300SSD1_163313AB002C-part5  ONLINE       0     0     0

That has two mirrored pairs of drives (roughly equivalent to RAID-10 in mdadm
terms), called mirror-0 and mirror-1.  BTW, "mirror-0" and "mirror-1" are
what are known as "vdev"s or "virtual devices".  A vdev can be a mirrored
set of drives as above, or a raid-z, or even a single drive (but there's no
redundancy for a single drive and adding one to a pool effectively destroys
the entire pool's redundancy, so never do that).  A ZFS pool is made up of
one or vdevs.  Also BTW, a mirrored set can be pairs as I have, or (just like
RAID-1 mirrors) you can mirror to three or four or more drives if you want
extra redundancy (and extra read speed)

Anyway, the vdevs here happen to be both the same size because I bought 4
identical SSDs to set it up with (4 x 256 GB was slightly more expensive than
2 x 512GB, but by spreading the IO over 4 drives rather than just two, I get
about double the read performance), but there's no reason at all why they
couldn't be different sizes.

e.g. the easiest and fastest way for me to double the capacity of that pool
would be to just add a pair of 512 GB SSDs to it.  It's nowhere near full, so
I won't be doing that any time in the forseeable future.

In fact, that's one of the advantages of using mirrored-pairs - you can
upgrade the pool two drives at a time.  Either by adding another pair of
drives, or by replacing both drives in a pair with larger drives.

For comparison, here's the main storage pool "export" of my MythTV box.  It
has one vdev called "raidz1-0", with 4 x 2TB drives.

# zpool status export
  pool: export
 state: ONLINE
  scan: scrub repaired 0B in 15h14m with 0 errors on Sat May 19 18:55:01 2018
config:

    NAME                                          STATE     READ WRITE CKSUM
    export                                        ONLINE       0     0     0
      raidz1-0                                    ONLINE       0     0     0
        ata-ST2000DL003-9VT166_5YD1QFAG           ONLINE       0     0     0
        ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5379164  ONLINE       0     0     0
        ata-WDC_WD20EARX-008FB0_WD-WCAZAJ827116   ONLINE       0     0     0
        ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5353040  ONLINE       0     0     0

If I wanted to upgrade it, I could either add a second vdev to the pool (a
mirrored pair, or another raid-z vdev), OR i could replace each of the 2 TB
drives with, say, 4TB drives.  I'd only see the extra capacity when ALL drives
in the vdev had been replaced.

In practice, that would take so long that it would be much faster to just
create a new pool with 4 x 4 TB drives and use 'zfs send' to copy everything
to the new pool, then retire the old pool.

BTW, You can see that I've had to replace one of the Western Digital drives
with a Seagate at some time in the past. If I wanted to know when that
happened, I could run "zpool history" - ZFS stores a history of every
significant thing that happens to the pool. e.g.

# zpool history export | grep ata-ST2000DL003-9VT166_5YD1QFAG
2016-06-05.15:00:03 zpool replace -f export ata-WDC_WD20EARX-00PASB0_WD-WCAZA8430027 /dev/disk/by-id/ata-ST2000DL003-9VT166_5YD1QFAG

and if I wanted to know when I created the pool:

# zpool history export | head -2
History for 'export':
2012-07-15.09:13:43 zpool create -f -o ashift=12 export raidz scsi-SATA_WDC_WD20EARX-00_WD-WCAZA8436337 scsi-SATA_WDC_WD20EARS-00_WD-WCAZA5379164 scsi-SATA_WDC_WD20EARX-00_WD-WCAZA8430027 scsi-SATA_WDC_WD20EARS-00_WD-WCAZA5353040

The ashift=12 option tells 'zpool' to create the pool aligned for 4K sectors
(2^12 = 4096) instead of the default 512 byte sectors (ashift=9, 2^9 = 512).
And The "scsi-SATA-" and "ata-" prefixes refer to the same drives. I expect
that I exported the pool and then re-imported it at some point and the drive
names changed slightly. or maybe after an upgrade the kernel stopped caring
about the fact that the SATA drives were on a SAS scsi controller. Don't know,
don't care, not important...the model and serial numbers identify the drives,
and I have sticky labels with the serial numbers on the hot-swap bays.

if you carefully compare the zpool create command with the status output
above, you'll notice that the drives listed in the create command aren't the
same as those in the status output.  I've had to replace a few of those WD
EARX drives in that pool.

# zpool history export | grep replace
2012-07-18.09:27:30 zpool replace -f export scsi-SATA_WDC_WD20EARX-00_WD-WCAZA8430027 scsi-SATA_WDC_WD20EARX-00_WD-WMAZA9502728
2013-01-03.21:15:57 zpool replace -f export scsi-SATA_WDC_WD20EARX-00_WD-WMAZA9502728 scsi-SATA_WDC_WD20EARX-00_WD-WCAZAJ827116
2016-05-18.22:24:28 zpool replace -f export ata-WDC_WD20EARX-00PASB0_WD-WCAZA8436337 /dev/disk/by-id/ata-WDC_WD20EARX-00PASB0_WD-WCAZA8430027
2016-06-05.15:00:03 zpool replace -f export ata-WDC_WD20EARX-00PASB0_WD-WCAZA8430027 /dev/disk/by-id/ata-ST2000DL003-9VT166_5YD1QFAG

In fact, you can see that on 18 May 2016, I tried to replace one of the drives
(WCAZA8436337) with one i'd previously removed and replaced (WCAZA8430027),
then about three weeks later on 6 June 2016 replaced it with a seagate drive.
That's what happens when you leave dead/dying drives just lying around without
writing "dead" on them.
...
I really appreciate the enormous amount of support. The only issue now is
how to create a roadmap from it all.
Again, no problem.  And, like I said, the best thing you can do is to start
playing with this stuff in some virtual machines. practice with it until it's
completely familiar, and until you understand it well enough to be able to
make informed decisions that suit your exact needs.

VMs are great for trying stuff out in a safe environment that won't mess
with your real system.  You can take some stupid risks and learn from them
- in fact, that's one of the great things about VMs for learning, you can
deliberately do all the things you've read not to do so you understand WHY
you shouldn't and also hopefully learn how you can recover from making such
disastrous mistakes.

In this case, VMs are also a good way to compare the differences between LVM
alone, mdadm alone, LVM+mdam, btrfs, and ZFS. and more. Learn what each is
capable of and how to use the tools to control them.  Learn what happens to
an array or pool when you tell KVM to detach one or more of the virtual disks
from the running VM. Or if you some garbage data to one or more of the vdisks.

I've got a few VMs for doing that.  One of them, called "ztest" (because it
started out being just for ZFS testing) has a 5GB boot disk (debian sid) plus
another 12 virtual disks attached to it, each about 200MB in size.  These get
combined in various configurations for zfs, btrfs, lvm, mdadm depending on
what I want to experiment with at the time.

It's still worth doing this even if you've already finished setting up your
new drives. as long as you've got somewhere to make a complete fresh backup of
your data, you can always rebuild your system if you find a better way to set
things up for your needs.

And also worth it because the better you know mdadm or lvm or zfs or whatever
you end up using, the less likely you are to panic and make a terrible mistake
if a drive dies and you have to replace it, or deal with some other problem.
the best time to learn system recovery techniques is before you need them, not
at the exact moment that you need them :)

craig

--
craig sanders <cas@taz.net.au>

Re: Biting the bullet - RAID

Craig Sanders