SMART self-test failed but drive still healthy

I have just had two drives failed in a server today. One is mostly part of a RAID0 set (which is in turn part of a DRBD, so we're still good) and a small partition that is part of a RAID1, which hasn't been failed (errors are about 1.3TB along a 2TB disk). The other is one I was testing, it wasn't particularly new and doesn't really matter. Both drives have logged read errors under Linux kernel, both report drive is healthy status (SMART overall-health self-assessment test result: PASSED), and both say "Completed: read failure" almost immediately when I do a SMART self test (short test or long). I don't really have any trouble with the fact that two drives have failed, but I'm really surprised that SMART still reports that the drive is good when it is clearly not... what's with that? James

On 01/05/2013, at 3:06 PM, James Harper <james.harper@bendigoit.com.au> wrote:
Both drives have logged read errors under Linux kernel, both report drive is healthy status (SMART overall-health self-assessment test result: PASSED), and both say "Completed: read failure" almost immediately when I do a SMART self test (short test or long).
Have you tried the same test with the drives in another machine? Are they under warranty? Can't speak for all manufacturers, but HP drives I've had fail in self-testing were replaced immediately under warranty.

On 01/05/2013, at 3:06 PM, James Harper <james.harper@bendigoit.com.au> wrote:
Both drives have logged read errors under Linux kernel, both report drive is healthy status (SMART overall-health self-assessment test result: PASSED), and both say "Completed: read failure" almost immediately when I do a SMART self test (short test or long).
Have you tried the same test with the drives in another machine?
Not yet. Will be doing this shortly.
Are they under warranty? Can't speak for all manufacturers, but HP drives I've had fail in self-testing were replaced immediately under warranty.
They are Seagate ES drives, so designed for 24x7 use. Not sure how long the explicit warranty is, but the consumer guarantee should ensure I get a replacement even if the part only has a 1 year written warranty. No harddisk manufacturer is going to argue that their product shouldn't reasonably be expected to last 3 years. Of course if it takes me more than an hour of time to chase the warranty it isn't worth it - disk is only around $250. Last time we had a customer with a failed Seagate harddisk I recommended they buy a new one then pursue the warranty (which is what I'm doing) as the warranty can take ages. Turns out the warranty turnaround was days, not weeks like I thought it would be, so I was impressed. Buying a new disk was still the faster option to ensure redundancy though. James

James Harper writes:
Both drives have logged read errors under Linux kernel, both report drive is healthy status (SMART overall-health self-assessment test result: PASSED), and both say "Completed: read failure" almost immediately when I do a SMART self test (short test or long).
I don't trust the "overall health". If a short or long test fails, I chuck the drive.

James Harper writes:
Both drives have logged read errors under Linux kernel, both report drive is healthy status (SMART overall-health self-assessment test result: PASSED), and both say "Completed: read failure" almost immediately when I do a SMART self test (short test or long).
I don't trust the "overall health". If a short or long test fails, I chuck the drive.
Of course. I'm just surprised that a self test can fail yet the drive still report itself as healthy. James

Hi, On 2/05/2013 12:06 AM, James Harper wrote:
I have just had two drives failed in a server today. One is mostly part of a RAID0 set (which is in turn part of a DRBD, so we're still good) and a small partition that is part of a RAID1, which hasn't been failed (errors are about 1.3TB along a 2TB disk). The other is one I was testing, it wasn't particularly new and doesn't really matter.
Both drives have logged read errors under Linux kernel, both report drive is healthy status (SMART overall-health self-assessment test result: PASSED), and both say "Completed: read failure" almost immediately when I do a SMART self test (short test or long).
I don't really have any trouble with the fact that two drives have failed, but I'm really surprised that SMART still reports that the drive is good when it is clearly not... what's with that?
This from Google: "Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that mod- els based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported." In a nutshell, SMART is not a good indicator of pending failure ... use it as an indication only, but certainly don't count on it. But really, SMART is next to useless overall, so it isn't even much of a "real" indicator.... YMMV. https://static.googleusercontent.com/external_content/untrusted_dlcp/researc... Cheers AndrewM

Hi,
On 2/05/2013 12:06 AM, James Harper wrote:
I have just had two drives failed in a server today. One is mostly part of a RAID0 set (which is in turn part of a DRBD, so we're still good) and a small partition that is part of a RAID1, which hasn't been failed (errors are about 1.3TB along a 2TB disk). The other is one I was testing, it wasn't particularly new and doesn't really matter.
Both drives have logged read errors under Linux kernel, both report drive is healthy status (SMART overall-health self-assessment test result: PASSED), and both say "Completed: read failure" almost immediately when I do a SMART self test (short test or long).
I don't really have any trouble with the fact that two drives have failed, but I'm really surprised that SMART still reports that the drive is good when it is clearly not... what's with that?
This from Google:
"Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that mod- els based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported."
In a nutshell, SMART is not a good indicator of pending failure ... use it as an indication only, but certainly don't count on it. But really, SMART is next to useless overall, so it isn't even much of a "real" indicator.... YMMV.
It's frustrating because a simple "if hard read errors > 0 || failed self tests > 0 then drive = not okay" would have meant I could just read the SMART health indicator and eject the drive from the array (or whatever it belonged to). James

James Harper <james.harper@bendigoit.com.au> writes:
It's frustrating because a simple "if hard read errors > 0 || failed self tests > 0 then drive = not okay" would have meant I could just read the SMART health indicator and eject the drive from the array (or whatever it belonged to).
IIRC from heterogeneous disks in an array I had once, I was getting 10* the number of errors on one pair of disks from the other pair. It turned out that seagate was reporting uncorrectable errors and WD was reporting all errors -- the seagate had an extra field where it reported the raw error rate. If you are gonna script a "not okay" heuristic, be careful not to overgeneralize from one vendor to the next.

James Harper <james.harper@bendigoit.com.au> writes:
It's frustrating because a simple "if hard read errors > 0 || failed self tests > 0 then drive = not okay" would have meant I could just read the SMART health indicator and eject the drive from the array (or whatever it belonged to).
IIRC from heterogeneous disks in an array I had once, I was getting 10* the number of errors on one pair of disks from the other pair. It turned out that seagate was reporting uncorrectable errors and WD was reporting all errors -- the seagate had an extra field where it reported the raw error rate.
If you are gonna script a "not okay" heuristic, be careful not to overgeneralize from one vendor to the next.
That's why I want the vendors to make the leap that "unrecoverable read error = unhealthy disk". The reported counters are not reliable, as you say. James
participants (4)
-
Andrew McGlashan
-
James Harper
-
Jeremy Visser
-
trentbuck@gmail.com