Hi,
On 2/05/2013 12:06 AM, James Harper wrote:
I have just had two drives failed in a server today.
One is mostly part of a RAID0 set (which is in turn part of a DRBD, so we're still
good) and a small partition that is part of a RAID1, which hasn't been failed (errors
are about 1.3TB along a 2TB disk). The other is one I was testing, it wasn't
particularly new and doesn't really matter.
Both drives have logged read errors under Linux kernel, both report drive is healthy
status (SMART overall-health self-assessment test result: PASSED), and both say
"Completed: read failure" almost immediately when I do a SMART self test (short
test or long).
I don't really have any trouble with the fact that two drives have failed, but
I'm really surprised that SMART still reports that the drive is good when it is
clearly not... what's with that?
This from Google:
"Our analysis identifies several parameters from the drive’s
self monitoring facility (SMART) that correlate highly with
failures. Despite this high correlation, we conclude that mod-
els based on SMART parameters alone are unlikely to be useful
for predicting individual drive failures. Surprisingly, we found
that temperature and activity levels were much less correlated
with drive failures than previously reported."
In a nutshell, SMART is not a good indicator of pending failure ... use
it as an indication only, but certainly don't count on it. But really,
SMART is next to useless overall, so it isn't even much of a "real"
indicator.... YMMV.
https://static.googleusercontent.com/external_content/untrusted_dlcp/resear…
Cheers
AndrewM