
On Tue, 19 Jan 2016 11:24:44 PM James Harper wrote:
If you use SSDs for any sort of intensive storage, do keep an eye on the SMART "media wearout" values, and replace them before the counter hits 0 (or 1). For the disks we were using (Intel DataCentre SSD's), the docs say
Interesting, I didn't realise that was the case. I've seen various claims of SSD wearout times being 5+ years which means they would be obsolete before they wear out.
disk has worn out), and in our case the performance went to crap sometime after the counter hit 1, causing considerable frustration to all involved.
Do you know of any good software to measure performance changes in disks? Performance often changes before drives develop faults and it would be good to measure that. EG if the number of iops the disk delivers when it's neat 100% according to the iostat algorithm suddenly changes by an order of magnitude then it's probably due for replacement. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

disk has worn out), and in our case the performance went to crap sometime after the counter hit 1, causing considerable frustration to all involved.
Do you know of any good software to measure performance changes in disks? Performance often changes before drives develop faults and it would be good to measure that. EG if the number of iops the disk delivers when it's neat 100% according to the iostat algorithm suddenly changes by an order of magnitude then it's probably due for replacement.
The failure mode was really strange in this case. The performance issue actually came good by itself on 2 occasions before we finally brought the server down for investigation, and found that the disks were 'worn out' (couldn't check SMART counters "through" the raid controller so had to be done offline). Any tests we did on the disks individually showed no problems, although we didn't test them extensively. SSD's are cheap enough (even DC rated ones) that if you suspect problems, and further "try it and see" testing would cause pain to an office full of employees, it is best to just replace them. Zabbix is what I use to graph performance on servers (% utilisation and disk queue length vs iops are useful measures), but as best as I could tell there was no real gradual reduction in performance, it just suddenly went bad. It didn't help that the Intel RAID controller wasn't very good at giving up any details about cache utilisation either. James
participants (2)
-
James Harper
-
Russell Coker