uncorrectable read errors

newer
Debian firmware package version...

James Harper

14 May 2013 14 May '13

6:35 a.m.

I've had a few disks fail with uncorrectable read errors just recently, and in the past my process is that any disk with any sort of error gets discarded and replaced, especially in a server. I did some reading though (see previous emails about SMART vs actual disk failures) and read that simply writing back over those sectors is often enough to clear the error and allow them to be remapped, possibly extending the life of the disk, depending on the cause of the error. In actual fact after writing the entire failed disk with /dev/zero the other day, all the SMART attributes are showing a healthy disk - no pending reallocations and no reallocated sectors, yet, so maybe it wrote over the bad sector and determined it was good again without requiring a remap. I'm deliberately using some old hardware to test ceph to see how it behaves in various failure scenarios, and has been pretty good so far despite 3 failed disks over the few weeks I've been testing. What can cause these unrecoverable read errors? Is losing power mid-write enough to cause this to happen? Or maybe a knock while writing? I grabbed these 1TB disks out of a few old PC's and NAS's I had lying around the place so their history is entirely uncertain. I definitely can't tell if they were already present when I started using ceph on them. Is Linux MD software smart enough to rewrite a bad sector with good data to clear this type of error (keeping track of error counts to know when to eject the disk from the array)? What about btrfs/zfs? Trickier with something like ceph where ceph runs on top of a filesystem which isn't itself redundant... Thanks James

Show replies by date

Allan Duncan

14 May 14 May

1:19 p.m.

On 14/05/13 16:35, James Harper wrote:

...

I've had a few disks fail with uncorrectable read errors just recently, and in the past my process is that any disk with any sort of error gets discarded and replaced, especially in a server. I did some reading though (see previous emails about SMART vs actual disk failures) and read that simply writing back over those sectors is often enough to clear the error and allow them to be remapped, possibly extending the life of the disk, depending on the cause of the error.

In actual fact after writing the entire failed disk with /dev/zero the other day, all the SMART attributes are showing a healthy disk - no pending reallocations and no reallocated sectors, yet, so maybe it wrote over the bad sector and determined it was good again without requiring a remap. I'm deliberately using some old hardware to test ceph to see how it behaves in various failure scenarios, and has been pretty good so far despite 3 failed disks over the few weeks I've been testing.

What can cause these unrecoverable read errors? Is losing power mid-write enough to cause this to happen? Or maybe a knock while writing? I grabbed these 1TB disks out of a few old PC's and NAS's I had lying around the place so their history is entirely uncertain. I definitely can't tell if they were already present when I started using ceph on them.

Is Linux MD software smart enough to rewrite a bad sector with good data to clear this type of error (keeping track of error counts to know when to eject the disk from the array)? What about btrfs/zfs? Trickier with something like ceph where ceph runs on top of a filesystem which isn't itself redundant...

A while back, when 4096 byte sectors went native, I had a disk with a - I think it said CRC error - on one sector. The interesting thing was when I read the sector with "dd conv=noerror" I got 4096 bytes, 7/8 of which was clearly valid directory info (NTFS) and 512 bytes were garbage. Go figure. Writing this sector back cleared the read error, but there was a bit of damage to the file system with 512 bytes of dud info. Now to add to the strange error messages from drives, I'm getting this one: [ 317.144766] EXT4-fs (sdb1): error count: 1 [ 317.144777] EXT4-fs (sdb1): initial error at 1345261136: ext4_find_entry:1209: inode 2 [ 317.144785] EXT4-fs (sdb1): last error at 1345261136: ext4_find_entry:1209: inode 2 sdb1 is mounted noatime, and this message turns up around the same time from boot. Smart tests and file system checks pass, I guess I'll just have to dump the entire 1TB+ to /dev/null to see if that trips anything usefull.

Russell Coker

2:49 p.m.

On Tue, 14 May 2013, James Harper <james.harper@bendigoit.com.au> wrote:

...

Is Linux MD software smart enough to rewrite a bad sector with good data to clear this type of error (keeping track of error counts to know when to eject the disk from the array)? What about btrfs/zfs? Trickier with something like ceph where ceph runs on top of a filesystem which isn't itself redundant...

I've seen Linux MD keep all disks in the array after a read error was reported on one disk, so presumably it did that as the error wasn't repeated. But you really want ZFS or BTRFS to cover the case where corrupt data is claimed to be good. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Robin Humble

20 May 20 May

6:17 a.m.

On Tue, May 14, 2013 at 06:35:26AM +0000, James Harper wrote:

...

What can cause these unrecoverable read errors? Is losing power mid-write enough to cause this to happen? Or maybe a knock while writing? I grabbed these 1TB disks out of a few old PC's and NAS's I had lying around the place so their history is entirely uncertain. I definitely can't tell if they were already present when I started using ceph on them.

it's best to think of disks as analogue devices pretending to be digital. often they can't read a marginal sector one day and then it's fine again the next day. some sectors come and go like this indefinitely, while others are bad enough that they're remapped and you never have an issue with them again. if the disk as a whole is bad enough then you run out of spare sectors to do remapping with, and the disk is dead. in my experience disks usually become unusable (slow, erratic, hangs drivers etc.) before they run out of spare sectors. with todays disk capacities this is just what you have to expect and software needs to be able to deal with it. silent data corruption is a much much rarer and nastier problem, and is the motivation behind the checksums in zfs, btrfs, xfs metadata etc.

...

Is Linux MD software smart enough to rewrite a bad sector with good data to clear this type of error (keeping track of error counts to know when to eject the disk from the array)?

yes.

...

What about btrfs/zfs? Trickier with something like ceph where ceph runs on top of a filesystem which isn't itself redundant...

all raid-like things need to deal with the expected 1-10% of real disk failures a year. depending on how they're implemented they could also turn these soft recoverable disk semi-failing scenarios into just more disk fails, or (like md does) try hard to recover the disk and data in-situ by smart re-writing and timeouts. the problem with kicking out at the first simple error is that full rebuilds involve lots of i/o and so are asking for a second failure. ideally it would be the call of user to tell the raid-like layer to try hard or to just fail out straight away, depending on seriousness of error, current redundancy level, disk characteristics, how valuable the data is, if i/o is latency sensitive, if data is backed up, etc., but that does seem quite complicated :-) as ceph is pitched as being for non-raid devices I would assume ceph must have 'filesystem gone read-only' detection (ie. the fs got a read error from a disk) as well as a 'disk/node hung/stopped timeout' detection. these are coarse but probably effective techniques. hopefully they then have something automated to dd /dev/zero over disks (and rebuild fs's and re-add to the active pool but on probation), otherwise it'll be a lot of work to track down and do that to each disk manually. cheers, robin

trentbuck＠gmail.com

21 May 21 May

12:30 a.m.

Robin Humble writes:

...

it's best to think of disks as analogue devices pretending to be digital. often they can't read a marginal sector one day and then it's fine again the next day. some sectors come and go like this indefinitely, while others are bad enough that they're remapped and you never have an issue with them again. if the disk as a whole is bad enough then you run out of spare sectors to do remapping with, and the disk is dead. in my experience disks usually become unusable (slow, erratic, hangs drivers etc.) before they run out of spare sectors.

with todays disk capacities this is just what you have to expect and software needs to be able to deal with it.

Am I right in thinking they become slow/erratic/unusable because of the extra time sent seeking back and forth between the original track and the spare track -- or just repeatedly trying to read a not-quite-dead sector? AIUI the justification for "enterprise" drives is they're basically the same as normal drives, except their firmware gives up much faster. If they're in an array, that means mdadm can just get on with reading the sector from one of the other disks, reducing the overall latency. Not that I've ever seen that myself -- I can't justify paying an order more for what ought to be a simple sdparm tweak :-/

James Harper

2:32 a.m.

...

Robin Humble writes:

...
it's best to think of disks as analogue devices pretending to be digital. often they can't read a marginal sector one day and then it's fine again the next day. some sectors come and go like this indefinitely, while others are bad enough that they're remapped and you never have an issue with them again. if the disk as a whole is bad enough then you run out of spare sectors to do remapping with, and the disk is dead. in my experience disks usually become unusable (slow, erratic, hangs drivers etc.) before they run out of spare sectors.

with todays disk capacities this is just what you have to expect and software needs to be able to deal with it.

Am I right in thinking they become slow/erratic/unusable because of the extra time sent seeking back and forth between the original track and the spare track -- or just repeatedly trying to read a not-quite-dead sector?

AIUI the justification for "enterprise" drives is they're basically the same as normal drives, except their firmware gives up much faster. If they're in an array, that means mdadm can just get on with reading the sector from one of the other disks, reducing the overall latency.

Not that I've ever seen that myself -- I can't justify paying an order more for what ought to be a simple sdparm tweak :-/

It's more complicated than that. Enterprise drives will be less likely to move the heads out of the way to reduce drag and reduce power consumption by a tiny bit. They are more inclined to automatically spin down when idle too. All those things typically increase wear and tear when the drivers are used in an enterprise environment. But you're right in that it's probably mostly just a firmware difference... I wonder if anyone has ever attempted to force an enterprise firmware onto a "green" drive... James

Russell Coker

3:04 a.m.

On Tue, 21 May 2013, "Trent W. Buck" <trentbuck@gmail.com> wrote:

...

Am I right in thinking they become slow/erratic/unusable because of the extra time sent seeking back and forth between the original track and the spare track -- or just repeatedly trying to read a not-quite-dead sector?

If you look at the contiguous IO performance of a brand new disk (which presumably has few remapped sectors) you will see a lot of variance in read times. The variance is so great that the occasional extra seek for a remapped sector is probably lost in the noise. Also I'd hope that the manufacturers do smart things about remapping. For example they could have reserved tracks at various parts of the disk instead of just reserving one spot and thus giving long seeks for remapped sectors.

...

AIUI the justification for "enterprise" drives is they're basically the same as normal drives, except their firmware gives up much faster. If they're in an array, that means mdadm can just get on with reading the sector from one of the other disks, reducing the overall latency.

One issue is the level of service that the users expect. If users are happy to accept some lack of performance when a disk is dying then there's less of a down-side to "desktop" drives. Also the "desktop" drives in addition to being cheaper also tend to be a lot bigger. Both the capacity and the price make it feasible to use greater levels of redundancy. For example a RAID-Z3 array of "desktop" disks is likely to give greater capacity and lower price than a RAID-5 of "enterprise" disks. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Robin Humble

22 May 22 May

7:55 a.m.

On Tue, May 21, 2013 at 01:04:36PM +1000, Russell Coker wrote:

...

On Tue, 21 May 2013, "Trent W. Buck" <trentbuck@gmail.com> wrote:

...
Am I right in thinking they become slow/erratic/unusable because of the extra time sent seeking back and forth between the original track and the spare track

nah, remapped sectors are fine (up until you run out of them). it's the 1/2 there 'sometimes readable' sectors that are evil - I call them heisensectors. best case for these is a very long delay (or a sequence of short delays from adjacent blocks, each under TLER) that causes a raid layer to kickout the drive. usually when it's kicked out smart says the drive looks pretty good, but it's not, it's insane. a more usual case is just a lot of short delays and no matter how many times you dd over the sector (or md rewrites it) it just keeps coming back. annoying but not fatal until it gets worse. worst case is a scsi driver hang due to a disk that is only 1/2 responding as presumably the drive firmware got confused, but not quite confused/hung enough to cause the firmware to watchdog and reset itself. this happened about once a month. by driver hang I mean at least one sas port was hung (24 disks) and sometimes a scsi host (48 disks) and sometimes all 96 disks. anyway, every disk is different (the above is Re: 1200 seagate 'enterprise' es1 1tb drives) but all disks are basically analogue and basically crazy.

...

...
-- or just repeatedly trying to read a not-quite-dead sector? If you look at the contiguous IO performance of a brand new disk (which presumably has few remapped sectors) you will see a lot of variance in read times. The variance is so great that the occasional extra seek for a remapped sector is probably lost in the noise.

ack. in my experience all the layers of kernel caching and readahead and firmware buffering will make the occasional big seek to a remapped sector basically free. I guess you could argue "what if I had one hot file with a remapped sector in it" but if you're always re-reading the same file off disk in a tight loop then you're probably doing something wrong :-)

...

Also I'd hope that the manufacturers do smart things about remapping. For example they could have reserved tracks at various parts of the disk instead of just reserving one spot and thus giving long seeks for remapped sectors.

IIRC that's one thing that enterprise drives claim to have that consumer drives don't - more spare sectors that are more distributed across the drive and smarter firmware to choose the closest spare.

...

cheaper also tend to be a lot bigger. Both the capacity and the price make it feasible to use greater levels of redundancy. For example a RAID-Z3 array of "desktop" disks is likely to give greater capacity and lower price than a RAID-5 of "enterprise" disks.

more drives == more Watts which is a growing concern. I'm back to using single large drives (with spun-down or offline backups) at home 'cos I don't like the power usage and noise of lots of raid drives. few *3tb is enough - I don't need 24tb always online. essentially I'm doing raid1 but with very delayed and power-friendly mirroring, and am prepared to do a fair bit of work and/or lose some data when I get unreadable sectors. cheers, robin

trentbuck＠gmail.com

23 May 23 May

1:50 a.m.

Robin Humble <rjh+luv@cita.utoronto.ca> writes:

...

I'm back to using single large drives (with spun-down or offline backups) at home 'cos I don't like the power usage and noise of lots of raid drives. few *3tb is enough - I don't need 24tb always online.

essentially I'm doing raid1 but with very delayed and power-friendly mirroring, and am prepared to do a fair bit of work and/or lose some data when I get unreadable sectors.

Are you doing that manually (e.g. a daily cron job calling rsync or dd), or do you have something like mdadm --write-behind=BIGNUM?

Robin Humble

5:27 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Thu, May 23, 2013 at 11:50:30AM +1000, Trent W. Buck wrote:

...

Robin Humble <rjh+luv@cita.utoronto.ca> writes:

...
I'm back to using single large drives (with spun-down or offline backups) at home 'cos I don't like the power usage and noise of lots of raid drives. few *3tb is enough - I don't need 24tb always online.

essentially I'm doing raid1 but with very delayed and power-friendly mirroring, and am prepared to do a fair bit of work and/or lose some data when I get unreadable sectors.

Are you doing that manually (e.g. a daily cron job calling rsync or dd), or do you have something like mdadm --write-behind=BIGNUM?

nothing clever, just manually. overly manually. every week or so I power on the backup box and run some rsyncs. it'd be possible to automate it (script that runs on power-up) but I haven't bothered. I actually kinda like a >>1day gap between backups as it gives me a window to retrieve files that I deleted by mistake. a fraction of a day is a bit short to realise that a file is gone and to look for it in the backups. incremental backups would work better, but so far I've been too lazy to work out a way to filter out fewGB tv shows that I've watched and deleted and don't want in any incrementals. BTW even a spun down external USB disk uses about 1 to 1.5W (I guess there's a litle arm chip @idle in there somewhere) so I use one of those for 'nearline', wheras the backups are truly off. the next power conserving project is to get tv recording and viewing working from one or 2 or 3 arm boxes (record, nas, view? ~1-5W each) instead of all being done by one big x86 (~70W). alternatively I might rip apart a cheap hdmi x86 laptop for its low power motherboard (15-30W?), add bigger drives (boot off usb, internal 3tb & dvd, external 3tb usb?), and use that as an all-in-one. sadly low power x86 laptop chips in desktop motherboards doesn't seem to be common. arm is lower power but intel makes the 1080p hdmi part of the htpc really easy (and I can still use mplayer) wheras it seems a mess and kinda marginal with limited choices of codec and players on all the linux arm trinkets I can find. cheers, robin

Russell Coker

7:23 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Thu, 23 May 2013, Robin Humble <rjh+luv@cita.utoronto.ca> wrote:

...

nothing clever, just manually. overly manually. every week or so I power on the backup box and run some rsyncs. it'd be possible to automate it (script that runs on power-up) but I haven't bothered.

I actually kinda like a >>1day gap between backups as it gives me a window to retrieve files that I deleted by mistake. a fraction of a day is a bit short to realise that a file is gone and to look for it in the backups.

http://etbe.coker.com.au/2012/12/17/using-btrfs/ I'm currently using BTRFS snapshots for that sort of thing. On some of my systems I have 100 snapshots stored from 15 minute intervals and another 50 or so stored from daily intervals. The 15 minute intervals capture the most likely possibilities for creating and accidentally deleting a file. The daily once cover more long-term mistakes. To cover hardware failure or significant sysadmin mistakes I make backups to USB attached SATA disks.

...

incremental backups would work better, but so far I've been too lazy to work out a way to filter out fewGB tv shows that I've watched and deleted and don't want in any incrementals.

On a BTRFS or ZFS system you would use a different subvolume/filesystem for the TV shows which doesn't get the snapshot backups. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

James Harper

8:26 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

...

-----Original Message----- From: luv-main-bounces@luv.asn.au [mailto:luv-main-bounces@luv.asn.au] On Behalf Of Russell Coker Sent: Thursday, 23 May 2013 5:24 PM To: luv-main@luv.asn.au Subject: Re: backups & power [was Re: uncorrectable read errors]

On Thu, 23 May 2013, Robin Humble <rjh+luv@cita.utoronto.ca> wrote:

...
nothing clever, just manually. overly manually. every week or so I power on the backup box and run some rsyncs. it'd be possible to automate it (script that runs on power-up) but I haven't bothered.

I actually kinda like a >>1day gap between backups as it gives me a window to retrieve files that I deleted by mistake. a fraction of a day is a bit short to realise that a file is gone and to look for it in the backups.

http://etbe.coker.com.au/2012/12/17/using-btrfs/

I'm currently using BTRFS snapshots for that sort of thing. On some of my systems I have 100 snapshots stored from 15 minute intervals and another 50 or so stored from daily intervals. The 15 minute intervals capture the most likely possibilities for creating and accidentally deleting a file. The daily once cover more long-term mistakes.

That's pretty neat. I do the same with Windows, but it's nice to see that Linux supports this now too. Windows would not support a 15 minutes snapshot interval though - docs say no more than 1 an hour or something like that. Recovering data under windows is as simple as right click then show previous versions and you select which snapshot you want to look at. Samba can do this too. How does performance fare with lots of snapshots? Windows goes with the concept that the snapshot holds the changed data, so first-write becomes a read-from-original + write-original-data-to-snapshot-area + write-new-data-to-original[1]. This reduces first-write performance but subsequent writes suffer no penalty, and means no fragmentation and throwing a snapshot away is instant. I think LVM actually writes the changed data into the snapshot area (still may require a read from original if the write isn't exactly the size of an extent) but I can't remember for sure. If so it means the first -write is faster but subsequent writes are still redirected to another part of the disk, and your data very quickly gets massively fragmented and recovery in the event of a booboo is a bitch if lvm metadata goes bad (from experience... I just gave up pretty much immediately and restored from backup when this happened to me[2]!). How does btrfs do it internally?

...

...
incremental backups would work better, but so far I've been too lazy to work out a way to filter out fewGB tv shows that I've watched and deleted and don't want in any incrementals.

On a BTRFS or ZFS system you would use a different subvolume/filesystem for the TV shows which doesn't get the snapshot backups.

I'm getting more and more excited about btrfs. I was looking around at zfs but it didn't end up meeting my needs. I'm still testing ceph and xfs is currently recommended for the backend store, btrfs is faster but has known issues with ceph, or at least did last time I read the docs and so is not currently recommended. James [1] with the default snapshot provider. Windows can outsource this function to a SAN or whatever else (could even be a Xen backend running zfs/btrfs!!) so obviously ymmv. [2] when the docs for clustered lvm say you can't use snapshots[3] then you can't use snapshots. Don't think you can. You will be wrong :) [3] this was years ago. It has changed since then, with some restrictions

Russell Coker

8:42 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Thu, 23 May 2013, James Harper <james.harper@bendigoit.com.au> wrote:

...

...
http://etbe.coker.com.au/2012/12/17/using-btrfs/

I'm currently using BTRFS snapshots for that sort of thing. On some of my systems I have 100 snapshots stored from 15 minute intervals and another 50 or so stored from daily intervals. The 15 minute intervals capture the most likely possibilities for creating and accidentally deleting a file. The daily once cover more long-term mistakes.

That's pretty neat. I do the same with Windows, but it's nice to see that Linux supports this now too. Windows would not support a 15 minutes snapshot interval though - docs say no more than 1 an hour or something like that. Recovering data under windows is as simple as right click then show previous versions and you select which snapshot you want to look at. Samba can do this too.

I believe that Samba can be integrated with various snapshot schemes. Last time I did Google searches for such things I saw some documentation about making Samba work with ZFS snapshots and I presume that BTRFS wouldn't be any more difficult (you could make BTRFS use the same directory names as ZFS for snapshots).

...

How does performance fare with lots of snapshots?

On BTRFS I haven't yet noticed any performance loss during operation. On older versions of BTRFS (such as is included with Debian/Wheezy) snapshot removal can be quite slow. I even once had a server become unusable because BTRFS snapshot removal took all the IO capacity of the system (after spending an hour prodding it I left it alone for 5 minutes and it came good). On ZFS I also haven't had any problems in operation although once due to a scripting mistake I ended up with about 1,000 snapshots of each of the two main filesystems. That caused massive performance problems in removing snapshots (operations such as listing snapshots taking many minutes) but came good when it was down below about 300 snapshots.

...

Windows goes with the concept that the snapshot holds the changed data, so first-write becomes a read-from-original + write-original-data-to-snapshot-area + write-new-data-to-original[1]. This reduces first-write performance but subsequent writes suffer no penalty, and means no fragmentation and throwing a snapshot away is instant. I think LVM actually writes the changed data into the snapshot area (still may require a read from original if the write isn't exactly the size of an extent) but I can't remember for sure. If so it means the first -write is faster but subsequent writes are still redirected to another part of the disk, and your data very quickly gets massively fragmented and recovery in the event of a booboo is a bitch if lvm metadata goes bad (from experience... I just gave up pretty much immediately and restored from backup when this happened to me[2]!).

How does btrfs do it internally?

BTRFS does all writes as copy-on-write. So every time you write the data goes to a different location. Keeping multiple versions just involves having pointers to different blocks on disk.

...

...
...
incremental backups would work better, but so far I've been too lazy to work out a way to filter out fewGB tv shows that I've watched and deleted and don't want in any incrementals.

On a BTRFS or ZFS system you would use a different subvolume/filesystem for the TV shows which doesn't get the snapshot backups.

I'm getting more and more excited about btrfs. I was looking around at zfs but it didn't end up meeting my needs. I'm still testing ceph and xfs is currently recommended for the backend store, btrfs is faster but has known issues with ceph, or at least did last time I read the docs and so is not currently recommended.

What issues would BTRFS have? XFS just provides a regular VFS interface which BTRFS does well. I can imagine software supporting ZFS but not BTRFS if it uses special ZFS features. But I'm not aware of XFS having useful features for a file store that BTRFS lacks. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

James Harper

8:50 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

...

...
How does btrfs do it internally?

BTRFS does all writes as copy-on-write. So every time you write the data goes to a different location. Keeping multiple versions just involves having pointers to different blocks on disk.

What measures are taken to avoid fragmentation?

...

...
I'm getting more and more excited about btrfs. I was looking around at zfs but it didn't end up meeting my needs. I'm still testing ceph and xfs is currently recommended for the backend store, btrfs is faster but has known issues with ceph, or at least did last time I read the docs and so is not currently recommended.

What issues would BTRFS have? XFS just provides a regular VFS interface which BTRFS does well. I can imagine software supporting ZFS but not BTRFS if it uses special ZFS features. But I'm not aware of XFS having useful features for a file store that BTRFS lacks.

http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/ It's not a feature thing it's a maturity/stability thing. I think ceph encounters a few more corner cases than regular usage might. Or maybe that page is a bit out of date and nobody wants the liability of updating it ;) James

Russell Coker

10:07 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Thu, 23 May 2013, James Harper <james.harper@bendigoit.com.au> wrote:

...

...
BTRFS does all writes as copy-on-write. So every time you write the data goes to a different location. Keeping multiple versions just involves having pointers to different blocks on disk.

What measures are taken to avoid fragmentation?

There is some work being done on making it automatically defragment. But at the moment not much I think. Avi might be the best LUV member to ask. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

trentbuck＠gmail.com

24 May 24 May

1:51 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

Russell Coker <russell@coker.com.au> writes:

...

I believe that Samba can be integrated with various snapshot schemes. Last time I did Google searches for such things I saw some documentation about making Samba work with ZFS snapshots and I presume that BTRFS wouldn't be any more difficult (you could make BTRFS use the same directory names as ZFS for snapshots).

It is nearly trivial to export rsnapshot backups as read-only snapshots over samba as "shadow copies", which Windows Explorer can then see natively in the context menu. I can dig up details if anyone's interested.

James Harper

2:05 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

...

Russell Coker <russell@coker.com.au> writes:

...
I believe that Samba can be integrated with various snapshot schemes. Last time I did Google searches for such things I saw some documentation about making Samba work with ZFS snapshots and I presume that BTRFS wouldn't be any more difficult (you could make BTRFS use the same directory names as ZFS for snapshots).

It is nearly trivial to export rsnapshot backups as read-only snapshots over samba as "shadow copies", which Windows Explorer can then see natively in the context menu.

I can dig up details if anyone's interested.

I remember installing one of the Linux NAS distributions (openfiler maybe?) many many years ago and found that it did all of this automatically which impressed me greatly at the time! James

Jason White

23 May 23 May

10:09 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

Russell Coker <russell@coker.com.au> wrote:

...

On Thu, 23 May 2013, Robin Humble <rjh+luv@cita.utoronto.ca> wrote:

...
nothing clever, just manually. overly manually. every week or so I power on the backup box and run some rsyncs. it'd be possible to automate it (script that runs on power-up) but I haven't bothered.

I have a similar strategy. The backup drive is actually a Btrfs file system created quite a long time ago when I thought Btrfsck (with the ability to correct errors) was just around the next corner on the development roadmap, an unduly optimistic assumption, as it turned out. In my partial defence, I was interested in the check sums, possibly also the snapshots, and I've always run btrfsck to check the integrity of the file system after unmounting it following a backup - so if there's a major error I should find out about if immediately after rsync and unmount have completed, not at restore time when it really matters. It is in fact a rather old BTRFS file system now; I can't remember the creation date though. When was the last on-disk format change requiring re-creation of file systems?

...

I'm currently using BTRFS snapshots for that sort of thing. On some of my systems I have 100 snapshots stored from 15 minute intervals and another 50 or so stored from daily intervals. The 15 minute intervals capture the most likely possibilities for creating and accidentally deleting a file. The daily once cover more long-term mistakes.

That's a convenient and thorough solution. In any directory containing work that I edit and want to keep, I maintain a Git repository. Running git init is easy enough and all that is required to maintain a reasonable history is a little discipline. I also use etckeeper for the same purpose (again with Git as the underlying version control tool). For especially important files (e.g., my PhD thesis and papers that I've written), I can push the repository to a remote machine owned by a trustworthy friend.

James Harper

11:01 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

...

http://etbe.coker.com.au/2012/12/17/using-btrfs/

You say "One is to use a single BTRFS filesystem with RAID-1 for all the storage and then have each VM use a file on that big BTRFS filesystem for all it's storage" - when you do this, what fs do you use on domU? (and is that the right "it's"?) Also, does btrfs have error detection or correction without using raid? I guess I should google a bit more :) James

Anders Holmström

11:22 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

James Harper:

...

"... for all it's storage" (and is that the right "it's"?)

No it's wrong. People get confused because it is possessive but there is no apostrophe on a possessive pronoun. The only correct usage of "it's" is as an abbreviation of "it is". Anders.

Rick Moen

10 Jun 10 Jun

3:06 p.m.

New subject: backups & power [was Re: uncorrectable read errors]

Quoting Anders Holmstr?m (anders.sputnik@gmail.com):

...

James Harper:

...
"... for all it's storage" (and is that the right "it's"?)

No it's wrong. People get confused because it is possessive but there is no apostrophe on a possessive pronoun. The only correct usage of "it's" is as an abbreviation of "it is".

'See, it's handy to have a publication that knows its it's from its its, isn't it?' http://linuxmafia.com/pub/humour/a-man-of-letters.html P.S>: I'm hoping to write a short story called "Charles' Wife Camilla", the saga of a polyandrous family comprising two men named Charle and their wife Camilla. See also: http://linuxmafia.com/~rick/lexicon.html#edwards

trentbuck＠gmail.com

12 Jun 12 Jun

5:08 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

Rick Moen writes:

...

P.S>: I'm hoping to write a short story called "Charles' Wife Camilla", the saga of a polyandrous family comprising two men named Charle and their wife Camilla.

ISTR E. B. White (might've been Fowler) saying "Charles'" should be "Charles's" unless he was in the Bible. I'm afraid I can't quite be arsed walking across the office to check.

Craig Sanders

5:24 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Wed, Jun 12, 2013 at 03:08:42PM +1000, Trent W. Buck wrote:

...

Rick Moen writes:

...
P.S>: I'm hoping to write a short story called "Charles' Wife Camilla", the saga of a polyandrous family comprising two men named Charle and their wife Camilla.

ISTR E. B. White (might've been Fowler) saying "Charles'" should be "Charles's" unless he was in the Bible. I'm afraid I can't quite be arsed walking across the office to check.

you're right. the book of revelations contains very specific grammatical rules for avoiding god's wrath in the apostropocalypse(*). (*) that gloriou's day when all those who mi'suse apostrophe's will burn in hell forever. craig -- craig sanders <cas@taz.net.au>

trentbuck＠gmail.com

5:43 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

Craig Sanders <cas@taz.net.au> writes:

...

you're right. the book of revelations contains very specific grammatical rules for avoiding god's wrath in the apostropocalypse(*).

Carl Turney

13 Jun 13 Jun

1:44 a.m.

New subject: Alternative life-stylers in I T

Hi All, In...

...

Re: backups & power [was Re: uncorrectable read errors]

Rick Moen writes:

...
P.S>: I'm hoping to write a short story called "Charles' Wife Camilla", the saga of a polyandrous family comprising two men named Charle and their wife Camilla.

It's so good to see people understanding the difference between polygamous, polyandrous, and polygynous... and in an IT (rather than alternative lifestyle) forum at that. Then again, I vaguely recall the appendix of "The Jargon File" (AKA "The Hackers' Dictionary") that hackers were more likely than the general public to be highly intelligent, logical, and into alternative lifestyles, psychedelics, leftist politics, etc. Then again, maybe those last 3 things are intelligent and logical? Cool. Carl

Lev Lafayette

3:25 a.m.

New subject: Alternative life-stylers in I T

Please post discussions like this in luv-talk, not luv-main. The luv-main list is specifically for technical discussions of Linux. All the best, -- Lev Lafayette, BA (Hons), MBA, GCertPM mobile: 0432 255 208 RFC 1855 Netiquette Guidelines http://www.ietf.org/rfc/rfc1855.txt

Russell Coker

23 May 23 May

11:39 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Thu, 23 May 2013, James Harper <james.harper@bendigoit.com.au> wrote:

...

...
http://etbe.coker.com.au/2012/12/17/using-btrfs/

You say "One is to use a single BTRFS filesystem with RAID-1 for all the storage and then have each VM use a file on that big BTRFS filesystem for all it's storage" - when you do this, what fs do you use on domU? (and is that the right "it's"?)

Ext3, Ext4, any filesystem that you like.

...

Also, does btrfs have error detection or correction without using raid?

By default a BTRFS filesystem will use RAID-1 for metadata on a single device by writing the data to two blocks. So a failure that results in one metadata block becoming corrupt or unreadable will result in the other being read. But if a data block becomes corrupt then you lose. You can configure BTRFS to use RAID-1 for data on a single device in theory at least, but last time I tried it the mkfs program didn't want to do that. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Craig Sanders

25 May 25 May

4:52 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Thu, May 23, 2013 at 11:01:47AM +0000, James Harper wrote:

...

...
http://etbe.coker.com.au/2012/12/17/using-btrfs/

You say "One is to use a single BTRFS filesystem with RAID-1 for all the storage and then have each VM use a file on that big BTRFS filesystem for all it's storage"

note that if you don't have any other particular reasons to use btrfs rather than zfs, then zfs is a better choice for this job. zfs allows you to create disk volumes (called "zvol") as well as filesystems from the pool, which are similar to a disk partition or an LVM logical volume but with better performance and all the other benefits of being part of a zpool. http://zfsonlinux.org/example-zvol.html http://pthree.org/2012/12/21/zfs-administration-part-xiv-zvols/ a zvol can also be exported via iscsi, so a VM on a compute node could use a zvol exported from a zfs file-server. could even use 2 or more zvols from different servers and raid them with mdadm (i haven't tried this myself but there's no reason why it shouldn't work - synchronised snapshotting may be problematic, you'd probably want to pause the VM briefly so you can snapshot the zvols on the file servers). btrfs doesn't yet have a zvol-like feature (i have no idea when or even if it is planned), so the only option there for a KVM or Xen VM is to use a large file as a qcow2 or whatever disk image. the btrfs wiki mentions "btrvols" on the Project Ideas page but it looks like no-one's even working on it yet. https://btrfs.wiki.kernel.org/index.php/Project_ideas#block_devices_.27btrvo... and, of course, with container-style VMs, you could use a btrfs subvolume or zfs filesystem. craig -- craig sanders <cas@taz.net.au>

Russell Coker

7:54 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Sat, 25 May 2013, Craig Sanders <cas@taz.net.au> wrote:

...

On Thu, May 23, 2013 at 11:01:47AM +0000, James Harper wrote:

...
...
http://etbe.coker.com.au/2012/12/17/using-btrfs/

You say "One is to use a single BTRFS filesystem with RAID-1 for all the storage and then have each VM use a file on that big BTRFS filesystem for all it's storage"

note that if you don't have any other particular reasons to use btrfs rather than zfs, then zfs is a better choice for this job.

I noted in the first paragraph that 4G of RAM for ZFS alone seems inadequate. I had ongoing problems with a server that was running nothing but Samba and NFS file serving with ZFS storage and 4G of RAM that gave repeated kernel errors about memory allocation. The recommended way of solving that problem didn't work, so I upgraded it to 12G of RAM (8G of RAM was a lot cheaper for the client than paying me to figure out the ZFS problem).

...

zfs allows you to create disk volumes (called "zvol") as well as filesystems from the pool, which are similar to a disk partition or an LVM logical volume but with better performance and all the other benefits of being part of a zpool.

http://zfsonlinux.org/example-zvol.html

I've got a Xen server that uses ZVols for the DomU block devices. I've been wondering if it really gives a benefit. If I had used a regular ZFS filesystem with large files to contain all the virtual block devices in question then I could have made a snapshot backup of all of them with one command. I've also been moving some of the larger storage (such as mailing list archives) from filesystems on block devices in the VM to NFS mounts of ZFS filesystems, this makes snapshotting a little more useful.

...

http://pthree.org/2012/12/21/zfs-administration-part-xiv-zvols/

a zvol can also be exported via iscsi, so a VM on a compute node could use a zvol exported from a zfs file-server. could even use 2 or more zvols from different servers and raid them with mdadm (i haven't tried this myself but there's no reason why it shouldn't work - synchronised snapshotting may be problematic, you'd probably want to pause the VM briefly so you can snapshot the zvols on the file servers).

Why would pausing the VM be necessary? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Craig Sanders

8:27 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Sat, May 25, 2013 at 05:54:44PM +1000, Russell Coker wrote:

...

On Sat, 25 May 2013, Craig Sanders <cas@taz.net.au> wrote:

...
note that if you don't have any other particular reasons to use btrfs rather than zfs, then zfs is a better choice for this job.

I noted in the first paragraph that 4G of RAM for ZFS alone seems inadequate.

yeah, well, that counts as a particular reason for using btrfs...although tuning the zfs_arc_min and zfs_arc_max module options is worth-while on a low-RAM zfs server. e.g. i have the following /etc/modprobe.d/zfs.conf file on my 16GB system....it's a desktop workstation as well as a ZFS fileserver, so I need to limit how much RAM zfs takes. # use minimum 1GB and maxmum of 4GB RAM for ZFS ARC options zfs zfs_arc_min=1073741824 zfs_arc_max=4294967296 does btrfs use significantly less RAM than zfs? i suppose it would, as it uses the linux cache whereas ZFS has its separate ARC.

...

[...problems with only 4GB RAM...] so I upgraded it to 12G of RAM (8G of RAM was a lot cheaper for the client than paying me to figure out the ZFS problem).

yep, adding RAM is a cheap and easy fix.

...

I've got a Xen server that uses ZVols for the DomU block devices. I've been wondering if it really gives a benefit.

in my experience, qcow2 files are slow. and especially slow over NFS. if shared storage for live migration isn't important, it would be worthwhile doing some benchmarking of zvol vs qcow on zfs vs qcow on btrfs.

...

...
a zvol can also be exported via iscsi, so a VM on a compute node could use a zvol exported from a zfs file-server. could even use 2 or more zvols from different servers and raid them with mdadm (i haven't tried this myself but there's no reason why it shouldn't work - synchronised snapshotting may be problematic, you'd probably want to pause the VM briefly so you can snapshot the zvols on the file servers).

Why would pausing the VM be necessary?

it's not, as a general rule. i was speculating that in the case of an mdadm raid array of iscsi zvols, it's possible the snapshots of the zvols on different servers could be different - it would be almost impossible to guarantee that the snapshots would run at exactly the same time. whether that's actually important or not, I don't know - but it doesn't sound like a desirable thing to happen. if the VM is paused briefly, that would prevent the VM from writing to the raid array while it was being snapshotted. e.g. 'virsh suspend <domain>', snapshot on the zfs servers, followed by 'virsh resume <domain>' - similar to what happens when you 'virsh migrate' a VM. zfs snapshots are fast, so the VM would only pause for a matter of seconds or perhaps even less than a second. i really ought to setup iscsi on my home zfs servers and experiment with this...i'll put it on my TODO list. craig -- craig sanders <cas@taz.net.au> BOFH excuse #434: Please state the nature of the technical emergency

Craig Sanders

8:53 a.m.

New subject: iscsi recommendations [was Re: backups & power [was Re: uncorrectable read errors]]

On Sat, May 25, 2013 at 06:27:22PM +1000, Craig Sanders wrote:

...

i really ought to setup iscsi on my home zfs servers and experiment with this...i'll put it on my TODO list.

anyone have any recommendations for iscsi? especially in a zfsonlinux context. or pointers to good HOWTOS? AFAICT, there seems to be several alternatives available. these seem to be the main ones packaged for debian: iscsitarget - iSCSI Enterprise Target userland tools istgt - iSCSI userspace target daemon for Unix-like operating systems lio-utils - configuration tool for LIO core target open-iscsi - High performance, transport independent iSCSI implementation tgt - Linux SCSI target user-space tools and i've also read people recommending stgt and scst for performance reasons, but they don't seem to be packaged. if they're significantly better, I don't mind that but would prefer not to have to maintain local packages for them myself. craig -- craig sanders <cas@taz.net.au> BOFH excuse #187: Reformatting Page. Wait...

James Harper

9:29 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

...

...
[...problems with only 4GB RAM...] so I upgraded it to 12G of RAM (8G of RAM was a lot cheaper for the client than paying me to figure out the ZFS problem).

yep, adding RAM is a cheap and easy fix.

Except that if you change all your VM's from 4G to 8G it means you can now run half as many... most of my VM's are IO and memory bound. James

Craig Sanders

11:24 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Sat, May 25, 2013 at 09:29:37AM +0000, James Harper wrote:

...

...
...
[...problems with only 4GB RAM...] so I upgraded it to 12G of RAM (8G of RAM was a lot cheaper for the client than paying me to figure out the ZFS problem).

yep, adding RAM is a cheap and easy fix.

Except that if you change all your VM's from 4G to 8G it means you can now run half as many... most of my VM's are IO and memory bound.

i thought Russell was talking about RAM on the zfs server, not on the VM? craig -- craig sanders <cas@taz.net.au> BOFH excuse #132: SCSI Chain overterminated

Russell Coker

10:12 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Sat, 25 May 2013, Craig Sanders <cas@taz.net.au> wrote:

...

e.g. i have the following /etc/modprobe.d/zfs.conf file on my 16GB system....it's a desktop workstation as well as a ZFS fileserver, so I need to limit how much RAM zfs takes.

# use minimum 1GB and maxmum of 4GB RAM for ZFS ARC options zfs zfs_arc_min=1073741824 zfs_arc_max=4294967296

options zfs zfs_arc_max=536870912 I just checked the system in question, it still has the above in the modules configuration from my last tests. "free" reports that 9G of RAM is used as cache so things seem to be getting cached anyway.

...

does btrfs use significantly less RAM than zfs? i suppose it would, as it uses the linux cache whereas ZFS has its separate ARC.

Yes. On any sort of modern system you won't notice a memory use impact of it. One Xen DomU has 192M of RAM assigned to it and BTRFS memory use isn't a problem. The system in question doesn't have serious load (it's used as a box I can ssh to to test other systems and for occasional OpenVPN use) and it could be that it gives less performance because of BTRFS. But the fact that it works at all sets it apart from ZFS. Also note that dpkg calls sync() a lot and thus gives poor performance when installing packages on BTRFS. As an aside, I'm proud to have filed the bug report against dpkg which led to this.

...

...
I've got a Xen server that uses ZVols for the DomU block devices. I've been wondering if it really gives a benefit.

in my experience, qcow2 files are slow. and especially slow over NFS.

I'm using a "file:" target in Xen for swap on some DomUs. That hasn't been a problem but then I have enough RAM to not swap much.

...

if shared storage for live migration isn't important, it would be worthwhile doing some benchmarking of zvol vs qcow on zfs vs qcow on btrfs.

Yes, that sounds like a good idea. I recently got a quad-core system with 8G of RAM from e-waste so I should do some benchmarks on such things.

...

i was speculating that in the case of an mdadm raid array of iscsi zvols, it's possible the snapshots of the zvols on different servers could be different - it would be almost impossible to guarantee that the snapshots would run at exactly the same time.

whether that's actually important or not, I don't know - but it doesn't sound like a desirable thing to happen.

RAID arrays are designed to be able to handle a device dropping out. If you are going to have members of the RAID array on different systems then by design you have a much greater risk of this than usual. If snapshots HAVE to be run at the same time then you'll probably have other problems. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Craig Sanders

11:40 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Sat, May 25, 2013 at 08:12:31PM +1000, Russell Coker wrote:

...

options zfs zfs_arc_max=536870912

I just checked the system in question, it still has the above in the modules configuration from my last tests. "free" reports that 9G of RAM is used as cache so things seem to be getting cached anyway.

yeah, but it'll be caching everything *except* for ZFS. zfsonlinux doesn't use linux's own caching, it uses ARC and L2ARC. it's not additional to linux's caching, it's instead of. this is on the TODO list to change in the future, but I don't think anyone's even working on it right now. lower priority than lots of other tasks

...

...
does btrfs use significantly less RAM than zfs? i suppose it would, as it uses the linux cache whereas ZFS has its separate ARC.

Yes. On any sort of modern system you won't notice a memory use impact of it.

One Xen DomU has 192M of RAM assigned to it and BTRFS memory use isn't a problem.

ok, cool.

...

The system in question doesn't have serious load (it's used as a box I can ssh to to test other systems and for occasional OpenVPN use) and it could be that it gives less performance because of BTRFS. But the fact that it works at all sets it apart from ZFS.

yeah, well, execpt in special circumstances (like a file-server where i'm going to be throwing lots of RAM at the VM anyway), i wouldn't use ZFS as a VM's filesystem. i'd use ext4 or xfs on a zvol.

...

Also note that dpkg calls sync() a lot and thus gives poor performance when installing packages on BTRFS. As an aside, I'm proud to have filed the bug report against dpkg which led to this.

dpkg's sync() fetish gives pretty poor performance on most filesystems. for an exciting time: # apt-get install eatmydata # ln -s /usr/bin/eatmydata /usr/local/sbin/dpkg (i used to do this on my home system with an xfs / before i got an SSD. dangerous, but *much* faster. YMMV but i never suffered any data loss or corruption of dpkg's status file etc from it - low risk of a problem but potentially catastrophic if power-failure/crash/whatever does occur)

...

If snapshots HAVE to be run at the same time then you'll probably have other problems.

true. as i said, it was speculation. of the 'what could possibly go wrong?' kind :) and i won't know whether it is a problem or not until i try it. craig -- craig sanders <cas@taz.net.au>

Russell Coker

1:54 p.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Sat, 25 May 2013, Craig Sanders <cas@taz.net.au> wrote:

...

...
Except that if you change all your VM's from 4G to 8G it means you can now run half as many... most of my VM's are IO and memory bound.

i thought Russell was talking about RAM on the zfs server, not on the VM?

The Xen server that I run on ZFS now has a lot of RAM spare in the Dom0 to avoid ZFS problems. It's fortunate that RAM is cheap, I was able to arrange to have more purchased than might otherwise be needed, and that the system doesn't need that much RAM for DomUs. But it is a bit of an annoyance. On Sat, 25 May 2013, Craig Sanders <cas@taz.net.au> wrote:

...

On Sat, May 25, 2013 at 08:12:31PM +1000, Russell Coker wrote:

...
options zfs zfs_arc_max=536870912

I just checked the system in question, it still has the above in the modules configuration from my last tests. "free" reports that 9G of RAM is used as cache so things seem to be getting cached anyway.

yeah, but it'll be caching everything *except* for ZFS. zfsonlinux doesn't use linux's own caching, it uses ARC and L2ARC. it's not additional to linux's caching, it's instead of.

The system in question has 6.2G of files on the root Ext4 filesystem, a lot of storage mounted as NFS from a NAS, and a ZFS pool. I believe that NFS doesn't get cached for long so the fairly small amount of use of the NFS mount makes it seem that a good portion of the now 10.5G of RAM that is reported as "cached" by free would have to be used by ZFS. There's nothing else for it to be used for.

...

...
The system in question doesn't have serious load (it's used as a box I can ssh to to test other systems and for occasional OpenVPN use) and it could be that it gives less performance because of BTRFS. But the fact that it works at all sets it apart from ZFS.

yeah, well, execpt in special circumstances (like a file-server where i'm going to be throwing lots of RAM at the VM anyway), i wouldn't use ZFS as a VM's filesystem. i'd use ext4 or xfs on a zvol.

I'd use NFS root if I wasn't running SE Linux on the DomUs.

...

...
Also note that dpkg calls sync() a lot and thus gives poor performance when installing packages on BTRFS. As an aside, I'm proud to have filed the bug report against dpkg which led to this.

dpkg's sync() fetish gives pretty poor performance on most filesystems.

for an exciting time:

# apt-get install eatmydata # ln -s /usr/bin/eatmydata /usr/local/sbin/dpkg

(i used to do this on my home system with an xfs / before i got an SSD. dangerous, but *much* faster. YMMV but i never suffered any data loss or corruption of dpkg's status file etc from it - low risk of a problem but potentially catastrophic if power-failure/crash/whatever does occur)

dpkg was apparently working most of the time for almost everyone before I decided to reproduce a bug I found in the SLES version of RPM and file a bug report about it. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

trentbuck＠gmail.com

27 May 27 May

1:49 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

Craig Sanders <cas@taz.net.au> writes:

...

dpkg's sync() fetish gives pretty poor performance on most filesystems.

For maximum fun, combine with collectd running unbuffered (i.e. default config). The mass of random seeks combined with dpkg's syncs were enough to bring my server to its knees. The dpkg workarounds have been mentioned; the collectd one is this. Note that it doesn't actually kick in until collectd has been running for half an hour. # On all collectd nodes (default is 10s) Interval 60 # On the collectd server LoadPlugin rrdtool <Plugin rrdtool> DataDir "/var/lib/collectd/rrd" # Instead of reducing the poll interval, increase the *write* # interval. Write any given RRD once every 30min, and randomly # distribute those across the entire 30min window (30±15min). CacheTimeout 1800 RandomTimeout 900 </Plugin>

...

(i used to do this on my home system with an xfs / before i got an SSD. dangerous, but *much* faster. YMMV but i never suffered any data loss or corruption of dpkg's status file etc from it - low risk of a problem but potentially catastrophic if power-failure/crash/whatever does occur)

dpkg also works much faster when the root filesystem is on a tmpfs :-) (I've been building a lot of live images lately; one buildhost has enough RAM I can build in /tmp instead of /var/tmp.)

trentbuck＠gmail.com

1:41 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

Russell Coker <russell@coker.com.au> writes:

...

Also note that dpkg calls sync() a lot and thus gives poor performance when installing packages on BTRFS. As an aside, I'm proud to have filed the bug report against dpkg which led to this.

FTR, note that the way dpkg calls sync changed significantly between debian 6 and 7 -- ubuntu 10.04 was partway through that transition and uses a third method that was briefly tried, then abandoned. This happened because ext4's default write delay was a lot longer than ext3's default, which led to data loss in unusual cases, and the increased syncing dpkg added to deal with that ran into serious performance issues on btrfs. In Debian 7 you can add force-unsafe-io to dpkg.cfg to turn off some of these syncs; using eatmydata (LD_PRELOAD wrapper than noops sync syscalls) makes it slightly faster still, and works on Debian 6 and Ubuntu 10.04. http://bugs.debian.org/430958 is the bug you're referring to? http://bugs.debian.org/575891 is mine :-)

Russell Coker

1:50 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Mon, 27 May 2013, "Trent W. Buck" <trentbuck@gmail.com> wrote:

...

This happened because ext4's default write delay was a lot longer than ext3's default, which led to data loss in unusual cases, and the increased syncing dpkg added to deal with that ran into serious performance issues on btrfs.

My bug report that you reference only mentions Ext4. But when I found the bug in rpm I was using XFS. My main concern at the time was with the way XFS works, but I didn't want people to get too focussed on one filesystem, I was afraid someone would say "just don't use XFS for the root fs then".

...

In Debian 7 you can add force-unsafe-io to dpkg.cfg to turn off some of these syncs; using eatmydata (LD_PRELOAD wrapper than noops sync syscalls) makes it slightly faster still, and works on Debian 6 and Ubuntu 10.04.

http://bugs.debian.org/430958 is the bug you're referring to? http://bugs.debian.org/575891 is mine :-)

-- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

trentbuck＠gmail.com

1:29 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

Craig Sanders <cas@taz.net.au> writes:

...

i was speculating that in the case of an mdadm raid array of iscsi zvols, it's possible the snapshots of the zvols on different servers could be different - it would be almost impossible to guarantee that the snapshots would run at exactly the same time.

Are you actually doing that? I once tried mdadm with remote drives, to see if it was a viable alternative to drbd. It fell over at the slightly transient network outage. Unless your network is invincible, I'm not sure this is a case worth thinking about.

...

if the VM is paused briefly, that would prevent the VM from writing to the raid array while it was being snapshotted.

Simply pausing the VM might not quiesce it -- it'd depend on whether a pause also implies a flush to disk. For example, lxc freeze basically just sends a SIGSTOP to all procs in the group.

Craig Sanders

25 May 25 May

5:11 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

On Thu, May 23, 2013 at 01:27:50AM -0400, Robin Humble wrote:

...

the next power conserving project is to get tv recording and viewing working from one or 2 or 3 arm boxes (record, nas, view? ~1-5W each) instead of all being done by one big x86 (~70W). alternatively I might rip apart a cheap hdmi x86 laptop for its low power motherboard (15-30W?), add bigger drives (boot off usb, internal 3tb & dvd, external 3tb usb?), and use that as an all-in-one. sadly low power x86 laptop chips in desktop motherboards doesn't seem to be common.

you might want to look into Intel Atom or AMD Fusion on mini-ITX or micro-ITX motherboards - they do use more power than an ARM but there are some very nicely featured motherboards and cases for them. generally a PCI and/or PCI-e slot, built-in NIC and wifi (often on a mini-PCI-e slot), standard DDR-3 RAM, SATA, USB, etc. both support hibernate and suspend to ram and/or disk too. IMO, the only reason to use a laptop for this kind of job is that the battery is effectively a nice UPS for it....but you'd lose that if you pulled it to pieces. laptop hardware tends to be weirder (as in non-standard) that ITX motherboards, some are seriously deranged....functional, but deranged. craig -- craig sanders <cas@taz.net.au> BOFH excuse #405: Sysadmins unavailable because they are in a meeting talking about why they are unavailable so much.

trentbuck＠gmail.com

27 May 27 May

2:01 a.m.

New subject: backups & power [was Re: uncorrectable read errors]

Craig Sanders <cas@taz.net.au> writes:

...

On Thu, May 23, 2013 at 01:27:50AM -0400, Robin Humble wrote:

...
sadly low power x86 laptop chips in desktop motherboards doesn't seem to be common.

you might want to look into Intel Atom or AMD Fusion on mini-ITX or micro-ITX motherboards - they do use more power than an ARM but there are some very nicely featured motherboards and cases for them.

Might be worth waiting a bit -- the Haswell ULT stuff looks very sexy. OTOH, that's geared for tablets and ultrabooks, so it might not retail in a standard form factor BGA motherboard like current Atom. | With its long term viability threatened, Haswell is the first step of | a long term solution to the ARM problem. While Atom was the first | "fast-enough" x86 micro-architecture from Intel, Haswell takes a | different approach to the problem. Rather than working from the bottom | up, Haswell is Intel's attempt to take its best micro-architecture and | drive power as low as possible. -- http://www.anandtech.com/print/6355/intels-haswell-architecture

Allan Duncan

22 May 22 May

7:41 a.m.

On 21/05/13 10:30, Trent W. Buck wrote:

...

Robin Humble writes:

...
it's best to think of disks as analogue devices pretending to be digital. often they can't read a marginal sector one day and then it's fine again the next day. some sectors come and go like this indefinitely, while others are bad enough that they're remapped and you never have an issue with them again. if the disk as a whole is bad enough then you run out of spare sectors to do remapping with, and the disk is dead. in my experience disks usually become unusable (slow, erratic, hangs drivers etc.) before they run out of spare sectors.

with todays disk capacities this is just what you have to expect and software needs to be able to deal with it.

Am I right in thinking they become slow/erratic/unusable because of the extra time sent seeking back and forth between the original track and the spare track -- or just repeatedly trying to read a not-quite-dead sector?

AIUI the justification for "enterprise" drives is they're basically the same as normal drives, except their firmware gives up much faster. If they're in an array, that means mdadm can just get on with reading the sector from one of the other disks, reducing the overall latency.

Not that I've ever seen that myself -- I can't justify paying an order more for what ought to be a simple sdparm tweak :-/

Well, a decade or two ago there was a large price difference between ATA and SCSI drives - more than the cost of the interface board. An acquaintance of mine chanced upon some Seagate bods while waiting at the Hong Kong airport and queried them on this exact point. Their explanation was that they did vibration and bearing noise tests, and the units that topped the class became SCSI (or "enterprise" these days) and the rest we poor sods got. There is good engineering justification for this sorting strategy. Of course that says nothing about the level of lubrication of the platters - "white worms".

Russell Coker

9:03 a.m.

On Wed, 22 May 2013, Allan Duncan <amd2345@fastmail.com.au> wrote:

...

acquaintance of mine chanced upon some Seagate bods while waiting at the Hong Kong airport and queried them on this exact point. Their explanation was that they did vibration and bearing noise tests, and the units that topped the class became SCSI (or "enterprise" these days) and the rest we poor sods got. There is good engineering justification for this sorting strategy.

The difference in noise between rack mount server systems and desktop systems is significant. It's no big deal to spend all day in a room with a dozen desktop PCs but being in a room with a single 1RU server for a few minutes is really unpleasant. The noise you hear is relative to the vibration that the inside of the server experiences. So drives that don't cope well with vibration will fail in servers even though they could work perfectly in a desktop PC. I'm aware of one instance where some disks worked in one server but not in another server of the same make and model due to slight differences in vibration from the cooling fans. So it's a good idea to pay extra for disks that can handle vibration that are to run in a vibrating server. But there's no point paying extra for that if the disks in question are going to run in a nice quiet desktop system. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

4401

Age (days ago)

4431

Last active (days ago)

List overview

Download

43 comments

11 participants

participants (11)

Allan Duncan
Anders Holmström
Carl Turney
Craig Sanders
James Harper
Jason White
Lev Lafayette
Rick Moen
Robin Humble
Russell Coker
trentbuck＠gmail.com