RE: Mail Server Really Slow

newer
debugging disk performance problems

James Harper

19 Jan 2016 19 Jan '16

11:11 p.m.

(it seems that "reply-all" no longer includes luv-main (from ms outlook at least), so I have to include it manually... what's with that?)

...

Of course a RAID-1 of SSDs will massively outperform the RAID-5 you have.

Given the size I guess it's one of the older HP servers that only takes ~70G disks. If you buy a cheap Dell PowerEdge server and put a couple of Intel SSDs (not > bought from Dell because Dell charges heaps for storage) in a RAID-1 configuration it will massively outperform the old HP server.

If you use SSDs for any sort of intensive storage, do keep an eye on the SMART "media wearout" values, and replace them before the counter hits 0 (or 1). For the disks we were using (Intel DataCentre SSD's), the docs say that while the disk may well keep running for a long time after the counter hits 1, it is considered worn out and is no longer covered by warranty. SMART does not consider this a failure/old age (the threshold is 0 but the counter never goes below 1), so you have to actually monitor the counter. The RAID controller probably won't tell you this either (that the disk has worn out), and in our case the performance went to crap sometime after the counter hit 1, causing considerable frustration to all involved. Different models and manufacturers obviously differ in this respect too. I'm seeing time-to-replacement of about 12 months on high load system where the SSD's are used for a RAID cache (ZFS, Intel RAID controllers, etc). At home, my little router that is just a laptop running Squid, has used up 2 of 100 SMART units in the ~12 months it has been running. Not particularly relevant to the discussion at hand, but with suggestions of "put in SSD's and all your trouble will go away", it is something you need to consider. James

Show replies by date

Craig Sanders

20 Jan 20 Jan

1:29 a.m.

New subject: Mail Server Really Slow

On Tue, Jan 19, 2016 at 11:11:22PM +0000, James Harper wrote:

...

(it seems that "reply-all" no longer includes luv-main (from ms outlook at least), so I have to include it manually... what's with that?)

who knows? outlook is weird. for list replies, it's better to just reply to the list without CC-ing everyone anyway. i don't care much either way (i have procmail and i'm not afraid to use it :), but some people really dislike getting dupes.

...

...
Of course a RAID-1 of SSDs will massively outperform the RAID-5 you have.

If you use SSDs for any sort of intensive storage, do keep an eye on the SMART "media wearout" values, and replace them before the counter hits 0 (or 1).

the only related value i can find on 'smartctl -a' on my 256GB OCZ Vertex is: 233 Remaining_Lifetime_Perc 0x0000 067 067 000 Old_age Offline - 67 I assume that means I´ve used up about 1/3rd of its expected life. Not bad, considering i've been running it for 500 days total so far: 9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 12005 12005 hours is 500 days. or 1.3 years. and over that time, i've read 17.4TB and written 11.9TB. on a 256GB SSD...equivalent to rewriting the entire drive 46 times. or approx 23GB of writes per day. 198 Host_Reads_GiB 0x0000 100 100 000 Old_age Offline - 17440 199 Host_Writes_GiB 0x0000 100 100 000 Old_age Offline - 11901 I can expect probably another 2.5 years from this SSD or so at my current/historical usage rates. by that time, i'll be more than ready to replace it with a bigger, faster, and cheaper M.2 SSD. and that's for an OCZ Vertex, one of the last decent drives OCZ made before they started producing crap and went bust (and subsequently got bought by Toshiba, who are now producing decent drives again under the OCZ brand name).....so relatively old technology compared to modern SSDs. I'd expect a modern Intel or Samsung (or OCZ) to have an even longer lifespan. according to http://www.anandtech.com/show/8239/update-on-samsung-850-pro-endurance-vnand... the 256GB Samsung 850 Pro has an expected lifespan of 70 years with 20GB/day writes or 14 years with 100GB/day writes. The 512GB model doubles that and the 1TB quadruple it. even if you distrust the published specs and regard them as marketing dept. lies, and discount them by 50% or even 75%, you're still looking at long lives for modern SSDs....more than long enough to last until the next upgrade cycle for your servers. So, yes, keep an eye on the "Remaining_Lifetime_Percentage" or "Wear Level Count" or whatever the SMART attribute is called on your particular SSD, but there's no need to worry too much about it unless you're writing 1TB/day or so (and even then it should last around 3.5 years).

...

I'm seeing time-to-replacement of about 12 months on high load system where the SSD's are used for a RAID cache (ZFS, Intel RAID controllers, etc).

12 months? how much are you writing to those things each day? BTW, my OCZs are partitioned and used for OS and /home and ZFS L2ARC and ZFS ZIL. i would consider usage to be fairly light, not heavy. the heaviest usage it suffers would be compiling stuff and the regular upgrades of debian sid.

...

Not particularly relevant to the discussion at hand, but with suggestions of "put in SSD's and all your trouble will go away", it is something you need to consider.

The endurance issues that SSDs suffered in the past are basically gone now. craig -- craig sanders <cas@taz.net.au>

James Harper

7:28 a.m.

New subject: Mail Server Really Slow

...

On Tue, Jan 19, 2016 at 11:11:22PM +0000, James Harper wrote:

...
(it seems that "reply-all" no longer includes luv-main (from ms outlook at least), so I have to include it manually... what's with that?)

who knows? outlook is weird.

for list replies, it's better to just reply to the list without CC-ing everyone anyway. i don't care much either way (i have procmail and i'm not afraid to use it :), but some people really dislike getting dupes.

As long as I remember to replace the To: with luv-main each time I reply, I guess it's workable.

...

...
...
Of course a RAID-1 of SSDs will massively outperform the RAID-5 you have.

If you use SSDs for any sort of intensive storage, do keep an eye on the SMART "media wearout" values, and replace them before the counter hits 0 (or 1).

the only related value i can find on 'smartctl -a' on my 256GB OCZ Vertex is:

233 Remaining_Lifetime_Perc 0x0000 067 067 000 Old_age Offline - 67

233 is reported as Media Wearout Indicator on the drives I just checked on a BSD box, so I guess it's the same thing but with a different description for whatever reason.

...

I assume that means I´ve used up about 1/3rd of its expected life. Not bad, considering i've been running it for 500 days total so far:

9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 12005

12005 hours is 500 days. or 1.3 years.

I just checked the server that burned out the disks pretty quick last time (RAID1 zfs cache, so both went around the same time), and it has 60% remaining after a year or so. As a cache for a fairly large array, it gets a lot of data. I don't have the 198 and 199 values you mentioned so I can't tell. I do have a "total LBA" read/written, but those are ridiculously low, like a few hundred mb, so are probably 32 bit values that have wrapped a few times.

...

and that's for an OCZ Vertex, one of the last decent drives OCZ made before they started producing crap and went bust (and subsequently got bought by Toshiba, who are now producing decent drives again under the OCZ brand name).....so relatively old technology compared to modern SSDs.

I've seen too many OCZ's fail within months of purchase recently, but not enough data points to draw conclusions from. Maybe a bad batch or something? They were all purchased within a month or so of each other, late last year. The failure mode was that the system just can't see the disk, except very occasionally, and then not for long enough to actually boot from.

...

according to http://www.anandtech.com/show/8239/update-on-samsung- 850-pro-endurance-vnand-die-size

the 256GB Samsung 850 Pro has an expected lifespan of 70 years with 20GB/day writes or 14 years with 100GB/day writes.

The 512GB model doubles that and the 1TB quadruple it.

even if you distrust the published specs and regard them as marketing dept. lies, and discount them by 50% or even 75%, you're still looking at long lives for modern SSDs....more than long enough to last until the next upgrade cycle for your servers.

Yep. I just got a 500GB 850 EVO for my laptop and it doesn't have any of the wearout indicators that I can see, but I doubt I'll get anywhere near close to wearing it out before it becomes obsolete.

...

So, yes, keep an eye on the "Remaining_Lifetime_Percentage" or "Wear Level Count" or whatever the SMART attribute is called on your particular SSD, but there's no need to worry too much about it unless you're writing 1TB/day or so (and even then it should last around 3.5 years).

...
I'm seeing time-to-replacement of about 12 months on high load system where the SSD's are used for a RAID cache (ZFS, Intel RAID controllers, etc).

12 months? how much are you writing to those things each day?

Lots and lots, obviously :) These ones were cache on an intel RAID controller, so they really got hammered. It's also possible that they weren't really the right model of SSD for what we used them for. James

Craig Sanders

9:15 a.m.

New subject: Mail Server Really Slow

On Wed, Jan 20, 2016 at 07:28:38AM +0000, James Harper wrote:

...

As long as I remember to replace the To: with luv-main each time I reply, I guess it's workable.

that happens even on just plain Replies, too - not just Reply-All? that's weird because the list munges the From: address, so a reply should go to the list.

...

...
233 Remaining_Lifetime_Perc 0x0000 067 067 000 Old_age Offline - 67

233 is reported as Media Wearout Indicator on the drives I just checked on a BSD box, so I guess it's the same thing but with a different description for whatever reason.

i dunno if that name comes from the drive itself or from the smartctl software. that could be the difference.

...

...
I assume that means I´ve used up about 1/3rd of its expected life. Not bad, considering i've been running it for 500 days total so far:

9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 12005

12005 hours is 500 days. or 1.3 years.

I just checked the server that burned out the disks pretty quick last time (RAID1 zfs cache, so both went around the same time), and it

i suppose read performance is doubled, but there's not really any point in RAIDing L2ARC. it's transient data that gets wiped on boot anyway. better to have two l2arc cache partitions and two ZIL partitions. and not raiding the l2arc should spread the write load over the 2 SSDs and probably increase longevity. my pair of OCZ drives have mdadm RAID-1 (xfs) for the OS + /home and another 1GB RAID1 (ext4) for /boot, and just partitions for L2ARC and ZIL. zfs mirrors the ZIL (essential for safety, don't want to lose the ZIL if one drive dies!) if you give it two or more block devices anyway, and it uses two or more block devices as independent L2ARCs (so double the capacity). $ zpool status export -v pool: export state: ONLINE scan: scrub repaired 0 in 4h50m with 0 errors on Sat Jan 16 06:03:30 2016 config: NAME STATE READ WRITE CKSUM export ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 logs sdh7 ONLINE 0 0 0 sdj7 ONLINE 0 0 0 cache sdh6 ONLINE 0 0 0 sdj6 ONLINE 0 0 0 errors: No known data errors this pool is 4 x 1TB. i'll probably replace them later this year with one or two mirrored pairs of 4TB drives. I've gone off RAID-5 and RAID-Z. even with ZIL and L2ARC, performance isn't great, nowhere near what RAID-10 (or two mirrored pairs in zfs-speak) is. like my backup pool. $ zpool status backup -v pool: backup state: ONLINE scan: scrub repaired 0 in 4h2m with 0 errors on Sat Jan 16 05:15:20 2016 config: NAME STATE READ WRITE CKSUM backup ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdi ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sdd ONLINE 0 0 0 sdc ONLINE 0 0 0 errors: No known data errors this pool has the 4 x 4TB Seagate SSHDs i mentioned recently. it stores backups for all machines on my home network.

...

...
and that's for an OCZ Vertex, one of the last decent drives OCZ made before they started producing crap and went bust (and subsequently got

sorry, my mistake. i meant OCZ Vector. sdh OCZ-VECTOR_OCZ-0974C023I4P2G1B8 sdj OCZ-VECTOR_OCZ-8RL5XW08536INH7R

...

I've seen too many OCZ's fail within months of purchase recently, but not enough data points to draw conclusions from. Maybe a bad batch or something? They were all purchased within a month or so of each other, late last year. The failure mode was that the system just can't see the disk, except very occasionally, and then not for long enough to actually boot from.

i've read that the Toshiba-produced OCZs are pretty good now, so possibly a bad batch. or sounds like you abuse the poor things with too many writes. even so, my next SSD will probably be a Samsung.

...

Yep. I just got a 500GB 850 EVO for my laptop and it doesn't have any of the wearout indicators that I can see, but I doubt I'll get anywhere near close to wearing it out before it becomes obsolete.

that's not good. i wish disk vendors would stop crippling their SMART implementations and treat it seriously. craig -- craig sanders <cas@taz.net.au>

James Harper

11:28 a.m.

New subject: Mail Server Really Slow

...

On Wed, Jan 20, 2016 at 07:28:38AM +0000, James Harper wrote:

...
As long as I remember to replace the To: with luv-main each time I reply, I guess it's workable.

that happens even on just plain Replies, too - not just Reply-All?

that's weird because the list munges the From: address, so a reply should go to the list.

Yep. Reply and Reply-All from Outlook 2016. Not sure who to blame for standards violation here...

...

...
...
67

233 is reported as Media Wearout Indicator on the drives I just checked on a BSD box, so I guess it's the same thing but with a different description for whatever reason.

...
...
233 Remaining_Lifetime_Perc 0x0000 067 067 000 Old_age Offline

i dunno if that name comes from the drive itself or from the smartctl software. that could be the difference.

smartctl. It has a vendor database concerning what each of the values are, so if the manufacturer of you drives says 233 = "Remaining Lifetime Percent", and the manufacturer of my drive says 233 = "Media Wearout Indicator", and the authors of smartmontools were aware of this, then that's what goes in the database and that's what gets reported.

...

...
...
I assume that means I´ve used up about 1/3rd of its expected life. Not bad, considering i've been running it for 500 days total so far:

9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 12005

12005 hours is 500 days. or 1.3 years.

I just checked the server that burned out the disks pretty quick last time (RAID1 zfs cache, so both went around the same time), and it

i suppose read performance is doubled, but there's not really any point in RAIDing L2ARC. it's transient data that gets wiped on boot anyway. better to have two l2arc cache partitions and two ZIL partitions.

and not raiding the l2arc should spread the write load over the 2 SSDs and probably increase longevity.

<snip>

...

...
I've seen too many OCZ's fail within months of purchase recently, but not enough data points to draw conclusions from. Maybe a bad batch or something? They were all purchased within a month or so of each other, late last year. The failure mode was that the system just can't see the disk, except very occasionally, and then not for long enough to actually boot from.

i've read that the Toshiba-produced OCZs are pretty good now, so possibly a bad batch. or sounds like you abuse the poor things with too many writes.

Nah these particular ones were just in PC's, and were definitely not warn out (on the one occasion where I actually got one to read for a bit, the SMART values were all fine. Servers get SSD's with supercaps :)

...

even so, my next SSD will probably be a Samsung.

Despite initial reservations (funny how you can easily find bad reports on any brand!) I have been impressed with the performance and longevity of the Samsungs, but I still don't have enough datapoints. James

Rohan McLeod

8:35 p.m.

New subject: Mail Server Really Slow

James Harper via luv-main wrote:

...

...
On Wed, Jan 20, 2016 at 07:28:38AM +0000, James Harper wrote:

...
As long as I remember to replace the To: with luv-main each time I reply, I guess it's workable. that happens even on just plain Replies, too - not just Reply-All?

that's weird because the list munges the From: address, so a reply should go to the list.

Yep. Reply and Reply-All from Outlook 2016. Not sure who to blame for standards violation here...

Well from SeaMonkey-mail " Reply" would just go to the "Reply-To:" address (James Harper <james@ejbdigital.com.au> in this case) "Reply All" is going to "Reply-To" (as above ) AND "To" address (luv-main@luv.asn.au in this case ) I am assumimg James doesn't want two emails, so I am deleting his direct address ! As far as I can recall SeaMonkey-mail has always behaved this way and nothing Russel has done has changed this ! Does this add anything to the discussion ? regards Rohan McLeod

Chris Samuel

11:30 a.m.

New subject: Mail Server Really Slow

On Wed, 20 Jan 2016 08:15:00 PM Craig Sanders via luv-main wrote:

...

that's weird because the list munges the From: address, so a reply should go to the list.

On the other hand Reply-To: is set back to the original poster. So I guess it depends whether your MUA prefers List-Post: over Reply-To: (which this version of Kmail seems to) or vice-versa. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

3450

Age (days ago)

3451

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Chris Samuel
Craig Sanders
James Harper
Rohan McLeod