
On Tue, May 14, 2013 at 06:35:26AM +0000, James Harper wrote:
What can cause these unrecoverable read errors? Is losing power mid-write enough to cause this to happen? Or maybe a knock while writing? I grabbed these 1TB disks out of a few old PC's and NAS's I had lying around the place so their history is entirely uncertain. I definitely can't tell if they were already present when I started using ceph on them.
it's best to think of disks as analogue devices pretending to be digital. often they can't read a marginal sector one day and then it's fine again the next day. some sectors come and go like this indefinitely, while others are bad enough that they're remapped and you never have an issue with them again. if the disk as a whole is bad enough then you run out of spare sectors to do remapping with, and the disk is dead. in my experience disks usually become unusable (slow, erratic, hangs drivers etc.) before they run out of spare sectors. with todays disk capacities this is just what you have to expect and software needs to be able to deal with it. silent data corruption is a much much rarer and nastier problem, and is the motivation behind the checksums in zfs, btrfs, xfs metadata etc.
Is Linux MD software smart enough to rewrite a bad sector with good data to clear this type of error (keeping track of error counts to know when to eject the disk from the array)?
yes.
What about btrfs/zfs? Trickier with something like ceph where ceph runs on top of a filesystem which isn't itself redundant...
all raid-like things need to deal with the expected 1-10% of real disk failures a year. depending on how they're implemented they could also turn these soft recoverable disk semi-failing scenarios into just more disk fails, or (like md does) try hard to recover the disk and data in-situ by smart re-writing and timeouts. the problem with kicking out at the first simple error is that full rebuilds involve lots of i/o and so are asking for a second failure. ideally it would be the call of user to tell the raid-like layer to try hard or to just fail out straight away, depending on seriousness of error, current redundancy level, disk characteristics, how valuable the data is, if i/o is latency sensitive, if data is backed up, etc., but that does seem quite complicated :-) as ceph is pitched as being for non-raid devices I would assume ceph must have 'filesystem gone read-only' detection (ie. the fs got a read error from a disk) as well as a 'disk/node hung/stopped timeout' detection. these are coarse but probably effective techniques. hopefully they then have something automated to dd /dev/zero over disks (and rebuild fs's and re-add to the active pool but on probation), otherwise it'll be a lot of work to track down and do that to each disk manually. cheers, robin