RE: gpt and grub

10 Apr 2013

      ...
On 2013-04-09 02:40, James Harper wrote:
...
I have a server that had 4 x 1.5TB disks installed in a RAID5
configuration (except /boot is a 'RAID1' across all 4 disks). One of
the disks failed recently and so was replaced with a 3TB disk,
I'd be very wary of running RAID5 on disks >2TB
Remember that, when you have a disk failure, in order to rebuild the array, it
needs to scan every sector of every remaining disk, then write to every
sector of the replacement disk.
Debian does a complete scan every month anyway. A HP raid controller will basically be constantly (slowly) doing a background scan during periods of low use.

And a full resync on my 4x3TB array only takes 6 hours, so the window is pretty small.

And in this case it's the server holding the backups, so while losing it would be inconvenient, there are other copies of the data too.

Also, smart monitoring helps monitor pending failures before a hard read or write error occurs. The original replacement was done because of a smart notification - the kernel logged a single scsi timeout error sometime after that, but the RAID remained consistent and the monthly surface scan ran after that but before replacement of the disk and no hard errors were reported, even though smart reported "failure within 24 hours". With a small number of exceptions this is consistent with my experience in failed disks.
...
Compare those number of read & write operations to the Mean Time
Between Failures of each disk, and you're (statistically) starting to get close to
the point there's a significant risk of a second drive failing prior to the rebuild
finishing. (For a given definition of "significant" risk)
For mission critical data I'd be going with RAID10 (or maybe RAID6 if I had battery backed write cache, but the performance is still pretty bad for any workload I would consider mission critical).

The MTBF for the disk is given as 1000000 hours, while most other disks I've seen are around the 750000/800000 hour mark. This server runs in a cupboard in a factory though and runs significantly hotter than room temperature and every time I work on it I end up covered in dirt, but this is the first disk failure after around 3 years of hard use.

The disks in question claim a "35% improvement over standard desktop drives" wrt MTBF, so the marketing hype says it's okay ;)

James