On 2/05/2013 12:06 AM, James Harper wrote:
I have just had two drives failed in a server
today. One is mostly part of a
RAID0 set (which is in turn part of a DRBD, so
we're still good) and a small
partition that is part of a RAID1, which hasn't been failed (errors are about
1.3TB along a 2TB disk). The other is one I was testing, it wasn't particularly
new and doesn't really matter.
Both drives have logged read errors under Linux kernel, both report drive is
healthy status (SMART overall-health self-assessment test result: PASSED),
and both say "Completed: read failure" almost immediately when I do a
SMART self test (short test or long).
I don't really have any trouble with the fact that two drives have failed, but
I'm really surprised that SMART still reports that the drive is good when it is
clearly not... what's with that?
This from Google:
"Our analysis identifies several parameters from the drive's
self monitoring facility (SMART) that correlate highly with
failures. Despite this high correlation, we conclude that mod-
els based on SMART parameters alone are unlikely to be useful
for predicting individual drive failures. Surprisingly, we found
that temperature and activity levels were much less correlated
with drive failures than previously reported."
In a nutshell, SMART is not a good indicator of pending failure ... use
it as an indication only, but certainly don't count on it. But really,
SMART is next to useless overall, so it isn't even much of a "real"
It's frustrating because a simple "if hard read errors > 0 || failed self
tests > 0 then drive = not okay" would have meant I could just read the SMART
health indicator and eject the drive from the array (or whatever it belonged to).