RAID-1 synchronisation

Russell Coker

4 Feb 2012 4 Feb '12

7:41 a.m.

I run a bunch of servers with Linux software RAID-1. I use bitmaps on all of them because the ongoing overhead (*) of bitmaps is better than the occasional overhead of a full resync. Recently one of my servers suddenly decided to do a complete RAID-1 resync for no apparent reason. Other servers with the same versions of all software (Debian/Squeeze with all updates) didn't do it. The server in question did crash a few times recently (**). Is a server crash likely to result in an entire RAID resync even when bitmaps are used? Does anyone have any advice other than throwing the server in the bin? (*) I really doubt that the overhead is as bad as some people claim. I plan to test it but haven't had time so far. (**) Currently dmesg output includes the following: [87347.834590] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [87347.844958] BUG: soft lockup - CPU#0 stuck for 94s! [swapper:0] I'm not sure if this is related to the crashes. I suspected a problem with eth1 and turned off TSO etc which seems to have helped. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Show replies by date

James Harper

4 Feb 4 Feb

8:42 a.m.

...

I run a bunch of servers with Linux software RAID-1. I use bitmaps on all of them because the ongoing overhead (*) of bitmaps is better than the occasional overhead of a full resync.

Recently one of my servers suddenly decided to do a complete RAID-1 resync for no apparent reason. Other servers with the same versions of all software (Debian/Squeeze with all updates) didn't do it. The server in question did crash a few times recently (**). Is a server crash likely to result in an entire RAID resync even when bitmaps are used?

Does anyone have any advice other than throwing the server in the bin?

I'd suggest netconsole but that's not going to help if your ethX interface is crashing, or are you already confident you are seeing all the messages at crash time? Does /proc/mdstat indicate that your bitmap is there and hasn't disappeared? Is your sata/sas/whatever controller sharing an irq at all? (you don't say what vintage your servers are). One thing that would cause a resync is if a disk got ejected from and then re-added to the array. This could happen if one of your disk controllers or one channel of your disk controller hung, eg maybe it was being serviced by the hung cpu you mention in **. I've never had a disk fail on a Linux RAID before though so I don't know if re-adding is something that might happen automatically next boot under any circumstance... it seems unlikely though, and unwanted. Unless the disk just 'disappeared' instead of reporting failure... But really, any of this should be logged, if not at crash time, then at next startup. The fact that you have other servers with identical software does seem to indicate a hardware failure. I've never been particularly comfortable that linux raid handles as many corner case failures as well as some of the hardware raid implementations. How expensive is the server, how expensive is the downtime, and how expensive is your time? And more importantly, how valuable is the data? These sort of crashes would make me worry that my data is just not going to be there one morning, or worse, is getting silently corrupted requiring dipping into backup archives to restore. James

Russell Coker

1:52 p.m.

On Sat, 4 Feb 2012, James Harper <james.harper@bendigoit.com.au> wrote:

...

...
Does anyone have any advice other than throwing the server in the bin?

I'd suggest netconsole but that's not going to help if your ethX interface is crashing, or are you already confident you are seeing all the messages at crash time?

When the server crashes it can't be pinged from either Ethernet port. That means that either both ports are entirely unusable or eth1 can't access the LAN and eth0 has no routing table (the server is in a DC in Germany and I can't ping eth0 from the LAN).

...

Does /proc/mdstat indicate that your bitmap is there and hasn't disappeared?

After further investigation I have discovered that I described the problem incorrectly. In future I will take more care about pasting data from the affected system so that anyone who wants to offer advice will know the correct situation even if I describe it badly. It seems that "check = " is not the same as "recovery = " which is what you see when a drive has failed and been added again. # cat /proc/mdstat Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0] sdb2[1] 2917680447 blocks super 1.2 [2/2] [UU] [===========>.........] check = 55.9% (1631568128/2917680447) finish=765.1min speed=28013K/sec bitmap: 1/22 pages [4KB], 65536KB chunk It seems that this is from /etc/cron.d/mdadm having a checkarray command which runs on the 3rd of the month, my slowest server didn't complete that in a reasonable amount of time while the other servers which aren't disk IO bound completed it before I noticed. The question is whether the checkarray command does any good. I've run a lot of systems with Linux software RAID and don't recall ever seeing it do any good. While a multi-day cron job with performance implications is going to do some harm. The concepts of BTRFS seem more appealing to me. If I had a BTRFS volume doing the RAID-1 then if the two disks differed then BTRFS would use checksums to determine which one was correct. Also with 2.7TB of disks and only 450G in use a BTRFS check would be a lot faster as it wouldn't check empty space. I'm assuming that BTRFS is good enough for Xen block devices...

...

Is your sata/sas/whatever controller sharing an irq at all? (you don't say what vintage your servers are).

The system was ordered new at the end of last year. It's got an i7-2600 CPU and I don't think it can be particularly old. /proc/interrupts indicates that no IRQ is shared, although I've never learned much about the new style of interrupts (which involves numbers >15).

...

serviced by the hung cpu you mention in **. I've never had a disk fail on a Linux RAID before though so I don't know if re-adding is something that might happen automatically next boot under any circumstance... it seems unlikely though, and unwanted. Unless the disk just 'disappeared' instead of reporting failure... But really, any of this should be logged, if not at crash time, then at next startup.

I've had failures in production before and not had it automatically re-add the disk. One thing though is that Linux software RAID is very hesitant to remove disks. The last time I threw a disk in the bin it was after the system BIOS gave a boot warning about SMART failures and the kernel gave SATA errors at boot, but software RAID kept it in the RAID set!

...

The fact that you have other servers with identical software does seem to indicate a hardware failure. I've never been particularly comfortable that linux raid handles as many corner case failures as well as some of the hardware raid implementations.

On the contrary, I KNOW that Linux software RAID is written by competent people. I'm more confident with the reliability of Linux RAID than with ANY hardware RAID.

...

How expensive is the server, how expensive is the downtime, and how expensive is your time? And more importantly, how valuable is the data? These sort of crashes would make me worry that my data is just not going to be there one morning, or worse, is getting silently corrupted requiring dipping into backup archives to restore.

The backups are adequate. I could get the server replaced, but I think that now I've got it working well and don't want to replace something I know with something I don't. I'm happy to live without TSO. Thanks for your suggestions. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

James Harper

5 Feb 5 Feb

12:04 a.m.

...

# cat /proc/mdstat Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0] sdb2[1] 2917680447 blocks super 1.2 [2/2] [UU] [===========>.........] check = 55.9% (1631568128/2917680447) finish=765.1min speed=28013K/sec bitmap: 1/22 pages [4KB], 65536KB chunk

It seems that this is from /etc/cron.d/mdadm having a checkarray command which runs on the 3rd of the month, my slowest server didn't complete that in a reasonable amount of time while the other servers which aren't disk IO bound completed it before I noticed.

Of course... I thought it was the 1st Sunday in the month, but maybe that's just a Debian thing # By default, run at 00:57 on every Sunday, but do nothing unless the day of # the month is less than or equal to 7. Thus, only run on the first Sunday of # each month. crontab(5) sucks, unfortunately, in this regard; therefore this # hack (see #380425). 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron -- all --idle --quiet; fi Is the server that much slower or that much more i/o bound that it would make a significant difference? You could still have a hardware problem causing a drop in disk IOPS.

...

The question is whether the checkarray command does any good. I've run a lot of systems with Linux software RAID and don't recall ever seeing it do any good. While a multi-day cron job with performance implications is going to do some harm.

There is obviously the performance hit to consider, but I bet the failure rate of disks is higher on the 1st Sunday of the month (or whenever your distribution automatically schedules it) than at other times. One thing it does do for you is 'touch' unused blocks, and finding that those are bad now rather than later is better IMO. Also, verifying consistency and finding that you have a silent corruption problem early can only be a good thing. This is especially important for RAID5 without battery backed write cache as it can detect the RAID5 write-hole (http://en.wikipedia.org/wiki/RAID_5_write_hole). Maybe write-intent bitmaps get around this these days though? I wonder if you can fiddle with the settings to only use a smaller amount of idle bandwidth (lower than --idle)? (if there is such a thing as idle bandwidth on your system) James

Trent W. Buck

1:10 a.m.

Russell Coker wrote:

...

# cat /proc/mdstat Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0] sdb2[1] 2917680447 blocks super 1.2 [2/2] [UU] [===========>.........] check = 55.9% (1631568128/2917680447) finish=765.1min speed=28013K/sec bitmap: 1/22 pages [4KB], 65536KB chunk

It seems that this is from /etc/cron.d/mdadm having a checkarray command which runs on the 3rd of the month, my slowest server didn't complete that in a reasonable amount of time while the other servers which aren't disk IO bound completed it before I noticed.

I'd have thought that was obvious from the emails cron sends about it. Hmm, on second thought maybe the emails I remember come via logcheck.

James Harper

2:13 a.m.

...

Russell Coker wrote:

...
# cat /proc/mdstat Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0] sdb2[1] 2917680447 blocks super 1.2 [2/2] [UU] [===========>.........] check = 55.9% (1631568128/2917680447) finish=765.1min speed=28013K/sec bitmap: 1/22 pages [4KB], 65536KB chunk

It seems that this is from /etc/cron.d/mdadm having a checkarray command which runs on the 3rd of the month, my slowest server didn't complete that in a reasonable amount of time while the other servers which aren't disk IO bound completed it before I noticed.

I'd have thought that was obvious from the emails cron sends about it. Hmm, on second thought maybe the emails I remember come via logcheck.

I think cron will only email you on a failure so maybe logcheck? James

Andrew McGlashan

3:49 a.m.

Hi, On 5/02/2012 12:10 PM, Trent W. Buck wrote:

...

Russell Coker wrote:

...
# cat /proc/mdstat Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0] sdb2[1] 2917680447 blocks super 1.2 [2/2] [UU] [===========>.........] check = 55.9% (1631568128/2917680447) finish=765.1min speed=28013K/sec bitmap: 1/22 pages [4KB], 65536KB chunk

It totally surprises me that you didn't know about this..... the checks have been there via cron for a very long time. Have you not set up /etc/aliases to get mail for root? Every time this check takes place on my servers AND it cannot start immediately, I get a message like the following: checkarray: I: selecting idle I/O scheduling class for resync of md0. Otherwise, I think it just goes ahead and does the check. You should also have something like the following in /etc/mdadm/mdadm.conf # instruct the monitoring daemon where to send mail alerts MAILADDR root Cheers -- Kind Regards AndrewM Andrew McGlashan Broadband Solutions now including VoIP Current Land Line No: 03 9012 2102 Mobile: 04 2574 1827 Fax: 03 9012 2178 National No: 1300 85 3804 Affinity Vision Australia Pty Ltd http://www.affinityvision.com.au http://adsl2choice.net.au In Case of Emergency -- http://www.affinityvision.com.au/ice.html

Chris Samuel

11:19 a.m.

On Sunday 05 February 2012 12:10:07 Trent W. Buck wrote:

...

I'd have thought that was obvious from the emails cron sends about it.

Certainly cron (for me with Ubuntu 11.04). From: "root" <root@csamuel.org> To: root@csamuel.org Subject: Cron <root@quad> if [ -x /usr/share/mdadm/checkarray ] && [ $(date +%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi checkarray: I: selecting idle I/O scheduling class for resync of md0. checkarray: I: selecting idle I/O scheduling class for resync of md1. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Matthew Cengia

9:41 p.m.

On 2012-02-05 12:10, Trent W. Buck wrote:

...

...
It seems that this is from /etc/cron.d/mdadm having a checkarray command which runs on the 3rd of the month, my slowest server didn't complete that in a reasonable amount of time while the other servers which aren't disk IO bound completed it before I noticed.

I'd have thought that was obvious from the emails cron sends about it. Hmm, on second thought maybe the emails I remember come via logcheck.

Yep, my array did a check this weekend also. the first email I got was cron saying that the check had begin, and the next was logcheck reorting on error count etc (both attached for reference). -- Regards, Matthew Cengia

Robin Humble

10 Feb 10 Feb

1:26 p.m.

IMHO the main purpose of check/scrub on sw or hw raids isn't to detect "right now" problems, but to shake out unreadable sectors and bad disks so that they don't cause major drama later. serious problems (eg. array failure) can occur during raid rebuild if the raid code tries to read from a second unrecoverably bad disk. we lose a few disks every time we do a md 'check' over our 104 md raid6's, but many more of the arrays do routine rewrites and fixup bad disk sectors and make things far safer in the long term. we also have rewrites happening ~daily in normal operation as bad disk sectors are found during reads and remapped automatically by writes done by the raid6 code. in the home context, bad disk sectors and the ability of the md code to hide and remap these automatically is probably the best reason to make a home raid instead of just put a single 'big enough' disk in something. if one sector goes bad in that single disk then it's pretty much restore from backup time as one part of the fs will be forever unreadable until you find and write over the bad block. the fs can also shutdown or go read-only if it finds something unreadable. whereas if its in a raid5/6 you likely won't care or notice the problem, and if the raid code doesn't auto remap the sector for you then you can do a check/scrub or kick out the disk and dd over it at your leisure. err, but having said that I'm currently thinking of single 3tb + a ~weekly 3tb backup disk to replace my htpc's 3 (all dying!) 1tb disks in raid5, as a single 3tb uses less power. On Mon, Feb 06, 2012 at 08:41:41AM +1100, Matthew Cengia wrote:

...

1 adam mdadm: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496

what sort of raid is it? 1,10,5,6? I may have missed that info in this thread... if raid1/10 then /usr/sbin/raid-check (on fedora at least) doesn't email about problems -> # Due to the fact that raid1/10 writes in the kernel are unbuffered, # a raid1 array can have non-0 mismatch counts even when the # array is healthy. These non-0 counts will only exist in # transient data areas where they don't pose a problem. However, # since we can't tell the difference between a non-0 count that # is just in transient data or a non-0 count that signifies a # real problem, simply don't check the mismatch_cnt on raid1 # devices as it's providing far too many false positives. But by # leaving the raid1 device in the check list and performing the # check, we still catch and correct any bad sectors there might # be in the device. cheers, robin

Russell Coker

1:54 p.m.

On Sat, 11 Feb 2012, Robin Humble <robin.humble@anu.edu.au> wrote:

...

IMHO the main purpose of check/scrub on sw or hw raids isn't to detect "right now" problems, but to shake out unreadable sectors and bad disks so that they don't cause major drama later.

serious problems (eg. array failure) can occur during raid rebuild if the raid code tries to read from a second unrecoverably bad disk.

Surely if there is a bad sector when doing a rebuild then it will only result in at most some corrupt data in one stripe. Surely no RAID implementation would be stupid enough to eject a second disk from a RAID-5 or a third disk from a RAID-6 because of a few errors!

...

we lose a few disks every time we do a md 'check' over our 104 md raid6's, but many more of the arrays do routine rewrites and fixup bad disk sectors and make things far safer in the long term. we also have rewrites happening ~daily in normal operation as bad disk sectors are found during reads and remapped automatically by writes done by the raid6 code.

That would be only unrecoverable read errors though wouldn't it? Not sectors that quietly have bogus data. AFAIK the MD driver doesn't support reading the entire stripe for every read to detect quiet corruption.

...

in the home context, bad disk sectors and the ability of the md code to hide and remap these automatically is probably the best reason to make a home raid instead of just put a single 'big enough' disk in something.

Except that if you have a RAID-1 then you can quietly lose the data unless you read through all the logcheck messages becuase mdadm doesn't report it when stripes don't match up. To actually get this benefit it seems that you need either RAID-6 (which almost no-one wants in their home network) or a BTRFS RAID-1 (which isn't yet ready for production).

...

if one sector goes bad in that single disk then it's pretty much restore from backup time as one part of the fs will be forever unreadable until you find and write over the bad block.

No, you generally just lose 1 file, or maybe 1 directory has it's files go to lost+found.

...

the fs can also shutdown or go read-only if it finds something unreadable. whereas if its in a raid5/6 you likely won't care or notice the problem, and if the raid code doesn't auto remap the sector for you then you can do a check/scrub or kick out the disk and dd over it at your leisure.

But if it's RAID-5 then the current state of play is that you won't notice it if one disk returns bogus data and the RAID scrub of a RAID-5 will probably cause corruption to spread to another sector.

...

On Mon, Feb 06, 2012 at 08:41:41AM +1100, Matthew Cengia wrote:

...
1 adam mdadm: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496

what sort of raid is it? 1,10,5,6? I may have missed that info in this thread...

That was a MD RAID-1 with 10M of random data dumped on one disk.

...

if raid1/10 then /usr/sbin/raid-check (on fedora at least) doesn't email about problems ->

# Due to the fact that raid1/10 writes in the kernel are unbuffered, # a raid1 array can have non-0 mismatch counts even when the # array is healthy.

The only way a filesystem can be healthy in such a situation is if the journal covers it. If the filesystem is something like Ext3 then the journal replay will result in writes to the data sectors which fixes that problem. So the only way I can imagine this not being a problem is if the scrub happens on an unmounted filesystem that has a journal in need of replay or if the sectors in question correlate to an already committed section of the journal or unallocated disk space.

...

These non-0 counts will only exist in # transient data areas where they don't pose a problem. However, # since we can't tell the difference between a non-0 count that # is just in transient data or a non-0 count that signifies a # real problem, simply don't check the mismatch_cnt on raid1 # devices as it's providing far too many false positives. But by # leaving the raid1 device in the check list and performing the # check, we still catch and correct any bad sectors there might # be in the device.

Since we can't tell if it's a problem or not we will just pretend that it's not a problem. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Robin Humble

5:23 p.m.

On Sat, Feb 11, 2012 at 12:54:55AM +1100, Russell Coker wrote:

...

On Sat, 11 Feb 2012, Robin Humble <robin.humble@anu.edu.au> wrote:

...
IMHO the main purpose of check/scrub on sw or hw raids isn't to detect "right now" problems, but to shake out unreadable sectors and bad disks so that they don't cause major drama later.

serious problems (eg. array failure) can occur during raid rebuild if the raid code tries to read from a second unrecoverably bad disk.

Surely if there is a bad sector when doing a rebuild then it will only result in at most some corrupt data in one stripe. Surely no RAID implementation would be stupid enough to eject a second disk from a RAID-5 or a third disk from a RAID-6 because of a few errors!

it can always eject a 2nd (or 3rd) disk for the same reason as it ejected the first - typically rewriting the bad sector failed, or too many bad sectors too close together, or ... so kick it out. the chances of it hitting an issue during rebuild are greater than during normal operation too as it has to read the whole disk to reconstruct the new drive (not just the bit of the drive with data on it), and also has to read the p or q parity parts of each stripe (that it otherwise never reads). again, (still IMHO :-) 'check' is mostly there to weed out the crap disks and reduce the likelyhood of multi-failure scenarios. there's some code in recent kernels for tracking the location of various bad sectors on some parts of a raid, in order to have eg. N+2 on most of the arry, and N+1 on a few stripes and still be able to work. I forget what the name of this feature is.

...

...
rewrites happening ~daily in normal operation as bad disk sectors are found during reads and remapped automatically by writes done by the raid6 code. That would be only unrecoverable read errors though wouldn't it?

yup. they are common.

...

Not sectors that quietly have bogus data.

correct, but those are very rare.

...

AFAIK the MD driver doesn't support reading the entire stripe for every read to detect quiet corruption.

correct. IIRC the md developers view is that that that sort of corruption is best detected by checksums at the fs or at a (future?) scsi checksum protocol level. not sure I entirely agree with them, but hey.

...

...
the fs can also shutdown or go read-only if it finds something unreadable. whereas if its in a raid5/6 you likely won't care or notice the problem, and if the raid code doesn't auto remap the sector for you then you can do a check/scrub or kick out the disk and dd over it at your leisure. But if it's RAID-5 then the current state of play is that you won't notice it if one disk returns bogus data and the RAID scrub of a RAID-5 will probably cause corruption to spread to another sector.

"one disk returns bogus data" is very rare, and just a 'check' won't change the data on the platters (unless it hits an unreadable sector) - it will just report the number of mismatches. but if you are in the rare 'silently corrupting disk' situation then yes, run a 'check' and it'll clock up a big mismatch count, which in the case of raid5/6 always means bad things.

...

...
On Mon, Feb 06, 2012 at 08:41:41AM +1100, Matthew Cengia wrote:

...
1 adam mdadm: RebuildFinished event detected on md device /dev/md/1, component device mismatches found: 10496

what sort of raid is it? 1,10,5,6? I may have missed that info in this thread... That was a MD RAID-1 with 10M of random data dumped on one disk.

10M? I forget what unit the mismatch cnt is in. for raid6 I'm pretty sure it's 512bytes, so maybe the above means ~5M? it's a lot though either way, so yeah - probably busted hw. the question becomes what is low/normal (I saw up to 768 before I stopped being worried, and 128 mismatches was common), and what is busted hw :-/

...

...
if raid1/10 then /usr/sbin/raid-check (on fedora at least) doesn't email about problems ->

# Due to the fact that raid1/10 writes in the kernel are unbuffered, # a raid1 array can have non-0 mismatch counts even when the # array is healthy. ... <much deleted> or unallocated disk space.

yes, spurious mismatches are found in unallocated disk space. ie. free space blocks as far as the fs is concerned. most likely scenario - the fs started doing i/o to one disk of the pair, then changed its mind and did the pair of DMA's to another location on the platters instead - voila - mismatches. the 'mismatch' region is still in free unallocated space as far as the fs is concerned, and it just hasn't been overwritten with new fs blocks yet, at which time the mismatch will go away. no corruption, but md sees mismatches. BTW, pretty sure we've been through all this a couple of years ago on this list :) the linux-raid list answers this question a lot too.

...

Since we can't tell if it's a problem or not we will just pretend that it's not a problem.

99.9% of the time it really is not a problem. if I ran a 'check' across 600 raid1's now, I bet 10%-50% of them would come back with 'mismatches' and they'd all be spurious. I guess if you had a threshold you knew was corruption vs. normal, then you could write a script to look at the mismatch_cnt and send an email. what would that level be though? depends on so many things... cheers, robin

Craig Sanders

10:29 p.m.

On Fri, Feb 10, 2012 at 12:23:27PM -0500, Robin Humble wrote:

...

it can always eject a 2nd (or 3rd) disk for the same reason as it ejected the first - typically rewriting the bad sector failed, or too many bad sectors too close together, or ... so kick it out.

another annoying cause of disks being kicked from mdadm (and zfs and presumably btrfs arrays too) is disk read timeouts due to the drive sleeping. particularly common when you have a HW raid card of some sort in JBOD mode - these often have much lower timeouts than just bare disks on a m/b SATA interface, because the assumption is that it's in a high-end server with high-end "enterprise" drives (where spares are budgeted for and quickly available) rather than commodity drives in a home server. The card reports the drive as dead/dying/failed, and mdadm/zfs/btrfs kicks it....even though there's nothing wrong with the disk, it just took too long to wake up from sleeping. the solution, at least for LSI 9211-8i SAS cards(*) like I have, is to re-flash the card's firmware in IT (Initiator Target) mode rather than Raid mode. (this particular issue annoyed the hell out of me before i figured out what was going on) (*) BTW, these cards are an extraordinarily cheap way of adding 8 SAS/SATA 6Gbps ports to your system. there are numerous re-badged models (from IBM, Dell, supermicro, and others), and they sell on ebay for anywhere from about $65 to $150. these are 8-port SAS 6Gbps cards...you can't even buy 4-port SATA cards for that. they do raid1/0/10 natively, but for linux you don't want that. just re-flash them with the IT firmware and run them as HBAs. ideal for mdadm, zfs, and btrfs. uses the mpt2sas driver in linux....GPL, and in the mainline kernel. nice. craig ps: speaking of sleeping drives, why do all drive manufacturers make their drives wake up when you query their temperature with SMART? meaning you can set them to sleep when idle (e.g. with hdparm -S) to reduce power usage and temperature, *OR* you can monitor their temperature, but you can't do both. you can use SMART to query some other drive data without waking them up, but not temperature. i've seen this with WD, Seagate, Hitachi, and Samsung drives. WTF? -- craig sanders <cas@taz.net.au>

James Harper

11:21 p.m.

...

ps: speaking of sleeping drives, why do all drive manufacturers make their drives wake up when you query their temperature with SMART? meaning you can set them to sleep when idle (e.g. with hdparm -S) to reduce power usage and temperature, *OR* you can monitor their temperature, but you can't do both. you can use SMART to query some other drive data without waking them up, but not temperature.

i've seen this with WD, Seagate, Hitachi, and Samsung drives. WTF?

I can see that this is a limitation without reason, but what use is reading the temperature of a sleeping drive? It would throw out all your averages. hddtemp will skip the temperature read if the drive is asleep unless you specify the '-w' option. James

Craig Sanders

11:41 p.m.

New subject: hdd temperatures and sleeping drives (was Re: RAID-1 synchronisation)

On Fri, Feb 10, 2012 at 11:21:04PM +0000, James Harper wrote:

...

I can see that this is a limitation without reason, but what use is reading the temperature of a sleeping drive? It would throw out all your averages.

i'm usually not interested in the average temp. i'm usually interested in the temp. right now and the trend over the last hour or so (i.e. how fast the temp. is rising)....to help decide whether to shut the system down or not on a hot summer day. if the drive temps are significantly over about 40C and look like they're heading towards 50+C then i want to shut it down even if the drives are sleeping. there's usually only a few days/year where it matters (and this summer has been good...i don't think we've had even one 40C day in Melbourne so far), but heat kills drives.

...

hddtemp will skip the temperature read if the drive is asleep unless you specify the '-w' option.

yep, but -w defeats the purpose of setting the drive's sleep timeout. craig -- craig sanders <cas@taz.net.au>

Russell Coker

11:55 p.m.

New subject: hdd temperatures and sleeping drives (was Re: RAID-1 synchronisation)

On Sat, 11 Feb 2012, Craig Sanders <cas@taz.net.au> wrote:

...

if the drive temps are significantly over about 40C and look like they're heading towards 50+C then i want to shut it down even if the drives are sleeping.

there's usually only a few days/year where it matters (and this summer has been good...i don't think we've had even one 40C day in Melbourne so far), but heat kills drives.

If the drives are sleeping then why not just use the temperature as reported by the motherboard? You could calibrate this by determining the difference in temperature between the motherboard and an idly loaded disk and assume that the difference is the same. Of course the other option is to rely on not all disks in your RAID set dying at once and just letting it run. I rely on the weather forecasts to determine when I should shut systems down, if we get a forecast of >40C then systems which aren't air-conditioned should be shut down. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

James Harper

11 Feb 11 Feb

12:14 a.m.

New subject: hdd temperatures and sleeping drives (was Re: RAID-1 synchronisation)

...

On Fri, Feb 10, 2012 at 11:21:04PM +0000, James Harper wrote:

...
I can see that this is a limitation without reason, but what use is reading the temperature of a sleeping drive? It would throw out all your averages.

i'm usually not interested in the average temp. i'm usually interested in the temp. right now and the trend over the last hour or so (i.e. how fast the temp. is rising)....to help decide whether to shut the system down or not on a hot summer day.

if the drive temps are significantly over about 40C and look like they're heading towards 50+C then i want to shut it down even if the drives are sleeping.

there's usually only a few days/year where it matters (and this summer has been good...i don't think we've had even one 40C day in Melbourne so far), but heat kills drives.

...
hddtemp will skip the temperature read if the drive is asleep unless you specify the '-w' option.

yep, but -w defeats the purpose of setting the drive's sleep timeout.

So... don't use the -w option, and hddtemp won't wake up the drives to ask them their temperature. Get your reading from another sensor if that happens, eg: temp=`hddtemp /dev/sdn || get_mb_temp` An asleep drive isn't going to overheat. Would it be fair to assume that you are also monitoring the CPU temperature? That's probably another good candidate for detecting an overtemperature condition that requires a shutdown. It is a little puzzling as to why an ATA drive needs to be spun up to take a temperature reading though... maybe the sensor is inside the enclosure and requires some airflow to ensure an accurate reading? James

Craig Sanders

2:12 a.m.

New subject: hdd temperatures and sleeping drives (was Re: RAID-1 synchronisation)

On Sat, Feb 11, 2012 at 12:14:03AM +0000, James Harper wrote:

...

...
yep, but -w defeats the purpose of setting the drive's sleep timeout.

So... don't use the -w option, and hddtemp won't wake up the drives to ask them their temperature. Get your reading from another sensor if that happens, eg:

well, duh. i already pointed out that you can configure the drives to sleep when idle or you can monitor the temperature, but not both (technically, you can configure exactly that but it's pointless because reading the temp wakes the drive) at the moment, i'm not doing either because I haven't yet reflashed my SAS card to IT mode (cant do it in my current m/b, i have to put the card in some other m/b and i haven't got around to it yet), and sleeping drives time out and get kicked out of my ZFS pools. I disabled hddtemp (and my munin script for it) when i started playing with drive sleeping, and haven't cared enough to re-enable it. it hasn't been hot enough this summer to bother :)

...

It is a little puzzling as to why an ATA drive needs to be spun up to take a temperature reading though... maybe the sensor is inside the enclosure and requires some airflow to ensure an accurate reading?

yeah, well that was what the PS in my post was about. it's a huge WTF! craig -- craig sanders <cas@taz.net.au> BOFH excuse #412: Radial Telemetry Infiltration

Chris Samuel

1:34 a.m.

On Saturday 11 February 2012 09:29:38 Craig Sanders wrote:

...

another annoying cause of disks being kicked from mdadm (and zfs and presumably btrfs arrays too) is disk read timeouts due to the drive sleeping.

Not just sleeping, "consumer" drives apparently try much harder to recover from dodgy sectors, up to 2 minutes for some drives so I'm told. "Enterprise" drives give up much quicker in the assumption that they're in a RAID array. Not surprisingly RAID code tends to not be very tolerant of disks that take so long to respond.. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

James Harper

2:10 a.m.

...

On Saturday 11 February 2012 09:29:38 Craig Sanders wrote:

...
another annoying cause of disks being kicked from mdadm (and zfs and presumably btrfs arrays too) is disk read timeouts due to the drive sleeping.

Not just sleeping, "consumer" drives apparently try much harder to recover from dodgy sectors, up to 2 minutes for some drives so I'm told. "Enterprise" drives give up much quicker in the assumption that they're in a RAID array. Not surprisingly RAID code tends to not be very tolerant of disks that take so long to respond..

"Enterprise" drives don't normally cost that much more, although looking at the prices right now they are quite a bit more expensive at the moment from at least one of our suppliers. Does anyone know if these timeouts are just a default setting that can be 'tweaked' or is it a function of firmware? James

Andrew McGlashan

4:04 a.m.

Hi, On 11/02/2012 1:10 PM, James Harper wrote:

...

"Enterprise" drives don't normally cost that much more, although looking at the prices right now they are quite a bit more expensive at the moment from at least one of our suppliers.

Drive pricing was severely effected by the Thailand floods. There is enough price differential between standard and enterprise drives, but the standard ones come with lesser warranty -- I think it is better to choose enterprise ALWAYS, just due to the warranty period extension. If the warranty is longer, then it should follow that the drive is more reliable and can be trusted more easily (aside from the "use" type settings). SCSI was the enterprise type drive of the past, now it is SAS over SATA, but only very generally .... SATA has many more options and is less pricey than SAS. If you want to compare standard drives to enterprise SAS drives [or even enterprise SATA against enterprise SAS], well the price differential is much greater. Cheers -- Kind Regards AndrewM Andrew McGlashan Broadband Solutions now including VoIP Current Land Line No: 03 9012 2102 Mobile: 04 2574 1827 Fax: 03 9012 2178 National No: 1300 85 3804 Affinity Vision Australia Pty Ltd http://www.affinityvision.com.au http://adsl2choice.net.au In Case of Emergency -- http://www.affinityvision.com.au/ice.html

Craig Sanders

2:20 a.m.

On Sat, Feb 11, 2012 at 12:34:30PM +1100, Chris Samuel wrote:

...

On Saturday 11 February 2012 09:29:38 Craig Sanders wrote:

...
another annoying cause of disks being kicked from mdadm (and zfs and presumably btrfs arrays too) is disk read timeouts due to the drive sleeping.

Not just sleeping, "consumer" drives apparently try much harder to recover from dodgy sectors, up to 2 minutes for some drives so I'm told. "Enterprise" drives give up much quicker in the assumption that they're in a RAID array.

yep, that too. sleeping just makes it really obvious and frequent :(

...

Not surprisingly RAID code tends to not be very tolerant of disks that take so long to respond..

hence the reason for re-flashing the LSI 9211-8i and similar cards to "IT" mode so it's just a plain dumb HBA without the enterprise level TLER that's usually still in a RAID card's "JBOD" mode. HBA mode is perfect for mdadm, btrfs, zfs, and other software raid/raid-like things when using consumer grade drives. craig -- craig sanders <cas@taz.net.au>

Chris Samuel

12 Feb 12 Feb

4:02 a.m.

On Saturday 11 February 2012 13:20:05 Craig Sanders wrote:

...

HBA mode is perfect for mdadm, btrfs, zfs, and other software raid/raid-like things when using consumer grade drives.

But is the kernels RAID code any more tolerant of waiting for a minute or so for a drive to respond before declaring it dead than RAID cards? I guess at least with software solutions you can hack it to be so.. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Craig Sanders

4:23 a.m.

On Sun, Feb 12, 2012 at 03:02:10PM +1100, Chris Samuel wrote:

...

On Saturday 11 February 2012 13:20:05 Craig Sanders wrote:

...
HBA mode is perfect for mdadm, btrfs, zfs, and other software raid/raid-like things when using consumer grade drives.

But is the kernels RAID code any more tolerant of waiting for a minute or so for a drive to respond before declaring it dead than RAID cards?

I guess at least with software solutions you can hack it to be so..

yep. that's the point :) hard-coded vs hackable or s/w tunable. leave it up to the software to decide. of course, that implies that the short timeouts on enterprise drives are preferable to the long timeouts on consumer drives because that allows the kernel to decide how many times / how long to retry for. craig -- craig sanders <cas@taz.net.au>

4888

Age (days ago)

4896

Last active (days ago)

List overview

Download

23 comments

8 participants

participants (8)

Andrew McGlashan
Chris Samuel
Craig Sanders
James Harper
Matthew Cengia
Robin Humble
Russell Coker
Trent W. Buck