
Wondering if anyone has any comments. A home system has N number of 1.5TB drives, running in RAID5. At one point, these drives stopped becoming available, so the last time I extended the array, I used a 2TB drive. That drive is formatted with: /dev/sdg1 2048 976735943 488366948 8e Linux LVM /dev/sdg2 976735944 3907029160 1465146608+ fd Linux RAID autodetect The LVM system in question holds my root/home, etc. Now that a 1.5TB drive has failed, I'm replacing it with another 2TB drive, and wondering the best way to use the remainder 500GB My first thought is to set up a RAID1 array, and put my PV on there. Is there a better plan? Are there going to be any issues, such as booting? I presume grub2 will be fine, and will just add an 'insmod raid' before the 'insmod lvm' line in my grub.cfg As for the procedure, my vague plan is: sfdisk -d /dev/sdg | sfdisk /dev/sdh mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdh1 missing pvcreate /dev/md1 vgextend system /dev/md1 pvmove -v /dev/sdg1 pvremove /dev/sdg1 mdadm --manage /dev/md1 --add /dev/sdg1 mount /, chroot into it, run grub-install Can anyone see any flaws in logic, places to improve? cheers, / Brett

On 05/04/12 13:46, Brett Pemberton wrote:
A home system has N number of 1.5TB drives, running in RAID5. At one point, these drives stopped becoming available, so the last time I extended the array, I used a 2TB drive. Now that a 1.5TB drive has failed, I'm replacing it with another 2TB drive, and wondering the best way to use the remainder 500GB Is there a better plan?
If two disks have failed and aren't available commercially any more, I'd say it's likely the rest will go sooner rather than later because they're all getting too old. Consider buying some more 2TB disks (at $125 a pop they're not dear), and then building a new array. This time, build it with ZFS (or maaaaaybe btrfs if you dare), as with those you can add more disks (of variable size) later and rebalance files. -Toby

On Thu, Apr 5, 2012 at 3:17 PM, Toby Corkindale < toby.corkindale@strategicdata.com.au> wrote:
On 05/04/12 13:46, Brett Pemberton wrote:
A home system has N number of 1.5TB drives, running in RAID5. At one point, these drives stopped becoming available, so the last time I extended the array, I used a 2TB drive. Now that a 1.5TB drive has failed, I'm replacing it with another 2TB drive, and wondering the best way to use the remainder 500GB Is there a better plan?
If two disks have failed and aren't available commercially any more, I'd say it's likely the rest will go sooner rather than later because they're all getting too old.
Actually only one disk has failed. The other 2TB drive in there is because when I last extended the array, 1.5TB drives weren't available.
Consider buying some more 2TB disks (at $125 a pop they're not dear), and then building a new array.
The issues are: 1) That doesn't provide enough of a jump over my 1.5TB drives, capacity wise, to make it worth it. 2) No space in the machine/enough sata ports to do this. The array has just over 7TB of data, so to do this with 2TB drives, I'd need 5 at the least. Which would make a total of 12 drives in the machine. More than it can hold, and more than my 8 sata ports will be happy with. If anything, I'll contemplate doing this with 3TB drives, once they drop in price enough.
This time, build it with ZFS (or maaaaaybe btrfs if you dare), as with those you can add more disks (of variable size) later and rebalance files.
I've been pretty happy with XFS on mdadm RAID5. Not sure if I'd feel safe moving to ZFS yet. cheers, / Brett

On Thu, Apr 05, 2012 at 03:26:55PM +1000, Brett Pemberton wrote:
1) That doesn't provide enough of a jump over my 1.5TB drives, capacity wise, to make it worth it.
yeah, as you say 3TB drives are too expensive at the moment.
2) No space in the machine/enough sata ports to do this.
8-port IBM M1015 SAS cards (a rebadged LSI 9220-8i - 8 SAS/SATA 6Gbps ports) can be had for $US70 on ebay (plus $30 postage per order). that's about $10/port. The cards are also easily flashed to Initiator Target mode for improved software raid compatibility. http://www.servethehome.com/ibm-m1015-part-1-started-lsi-92208i/ http://www.ebay.com.au/itm/IBM-ServeRAID-M1015-SAS-RAID-Controller-FRU-46M08... that ebay item page implies that the $30 shipping is per card, but when i asked them about it, they said they would combine shipping for multiple cards. the cards are also missing the back-panel bracket, and need to be fitted with either a full-height or low-profile bracket (a couple of bucks each or scavenged from a old or dead card).
The array has just over 7TB of data, so to do this with 2TB drives, I'd need 5 at the least.
hang the new drives outside the case during the transfer (propping them up on individual cardboard boxes or similar for airflow and pointing a big fan at them is probably not a bad idea), move them into the case afterwards and re-purpose the old drives. fs-level compression would probably bring that down to 4 or 5TB unless it's mostly non-compressible data like videos.
Which would make a total of 12 drives in the machine. More than it can hold, and more than my 8 sata ports will be happy with. If anything, I'll contemplate doing this with 3TB drives, once they drop in price enough.
yeah, i'm waiting for 3TB drives to get around the $100 mark before i upgrade my zpools. they were slowly heading in that direction before the thailand floods last year but have now stabilised at around $200. maybe in a year or so. MSY has WD Green 3TB for $195, and Seagate 3TB (barracuda, i think) for $219 - WD Green drives are OK but be wary of TLER issues with a raid-card in JBOD mode rather then IT mode. 4x$219=$876 for 9TB raid5/raidz-1 storage vs 5*$120=$600 for 8TB r5/rz1 storage. might be worth considering when 3TB drives get down to $150...but i think i'll still wait for $100. hmmm...with an 8-port IBM M1015 as above, 8x$120 = $960 for either 14TB r5/rz1 or 12TB raid-6/raid-z2 storage.
This time, build it with ZFS (or maaaaaybe btrfs if you dare), as with those you can add more disks (of variable size) later and rebalance files.
+1 iirc i started using btrfs early last year, and then switched to zfs in the last half of the year. love it - it's exactly what i've always wanted for disk and filesystem management. i know i built my backup pool in Sep. dunno for sure when i first built my 'export' pool but it was a few weeks before then (i destroyed and recreated it after creating the backup pool). # zpool history backup | head 2011-09-27.12:35:13 zpool create -f -o ashift=12 backup raidz scsi-SATA_ST31000528AS_6VP3FWAG scsi-SATA_ST31000528AS_9VP4RPXK scsi-SATA_ST31000528AS_9VP509T5 scsi-SATA_ST31000528AS_9VP4P4LN 2011-09-27.12:37:41 zfs receive -v backup/asterisk 2011-09-27.12:37:53 zfs set mountpoint=/backup/hosts/asterisk backup/asterisk 2011-09-27.12:39:49 zfs receive -v backup/hanuman 2011-09-27.12:40:21 zfs set mountpoint=/backup/hosts/hanuman backup/hanuman 2011-09-27.12:55:15 zfs receive -v backup/kali 2011-09-27.12:59:57 zfs set mountpoint=/backup/hosts/kali backup/kali 2011-09-27.13:41:36 zfs receive -v backup/indra 2011-09-27.13:43:04 zfs set mountpoint=/backup/hosts/indra backup/indra # zpool history export | head History for 'export': 2011-10-01.09:26:43 zpool create -o ashift=12 -f export -m /exp raidz scsi-SATA_WDC_WD10EACS-00_WD-WCASJ2114122 scsi-SATA_WDC_WD10EACS-00_WD-WCASJ2195141 scsi-SATA_WDC_WD10EARS-00_WD-WMAV50817803 scsi-SATA_WDC_WD10EARS-00_WD-WMAV50933036 2011-10-01.09:27:21 zfs create export/home 2011-10-01.09:27:49 zfs set compression=on export 2011-10-02.09:55:31 zpool add export cache scsi-SATA_Patriot_Torqx_278BF0715010800025492-part7 2011-10-02.09:55:45 zpool add export log scsi-SATA_Patriot_Torqx_278BF0715010800025492-part6 2011-10-02.22:55:47 zfs create export/src 2011-10-02.23:03:44 zfs create export/ftp 2011-10-02.23:03:57 zfs set compression=off export/ftp 2011-10-02.23:04:24 zfs set atime=off export/ftp
I've been pretty happy with XFS on mdadm RAID5. Not sure if I'd feel safe moving to ZFS yet.
zfsonlinux is quite stable. I've only had a few minor problems in the six+ months i've been using it(*), and nothing even remotely resembling data loss. i trust it a LOT more than i ever trusted btrfs. according to my zpool history, i've had one WD 1TB drive die in my "export" pool, easily replaced with a seagate 1TB. and then that seagate died about three weeks later and i replaced it with another one. no disk deaths since then. 2011-11-28.19:26:00 zpool replace export scsi-SATA_WDC_WD10EARS-00_WD-WMAV50933036 scsi-SATA_ST31000528AS_9VP16X03 [...] 2011-12-19.09:10:12 zpool replace export scsi-SATA_ST31000528AS_9VP16X03 scsi-SATA_ST31000528AS_9VP18CCV (neat, i just realised i finally have real data on how often drives die on me and how often i have to replace them, rather than vague recollections) (hmmm...given what i've learnt about JBOD and TLER and my LSI 8-port card since then, it's possible that the WD and the 1st seagate aren't actually dead, they just got booted by the LSI card. i haven't got around to flash the card to IT mode because the dos flash program doesn't like my fancy modern motherboard, i'll have to pull the card from the system and flash it in an older machine) some of ZFS's nicer features: * disk, pool, volume, filesystem and snapshot management * much simpler management tools than lvm + mdadm + mkfs * extremely lightweight fs & subvolume creation. ditto for snapshots. * optional compression of individual filesystems and zvols * size limits on created filesystems and volumes are more like an easily-changed quota than, say, increasing the size of an lv on LVM.. e.g. zfs create -V 5G poolname/volname oops, i meant to make that 10G: zfs set volsize=10G poolname/volname BTW, both of those commands are effectively instant, a second or so. i can't recall if a VM running off that volume would recognise the size change immediately (or with partprobe) or if i would have to reboot it before i could repartition it and resize the VM's fs. * can use an SSD as L2ARC (read cache) and/or for the ZiL ZFS Intent Log (random write cache. better than a battery backed nv cache for a hw raid card) * error detection and correction * 'zfs send snapshotname | ssh remotehost zfs receive ...' zfs knows which blocks have changed in the snapshot so an increment zfs send | zfs recv is faster and less load than rsync. * audit trail / history log of actions. useful to know when you did something and also as a reminder of HOW to do some uncommon task. ZFS can also do de-duping(**) but vast quantities of RAM & L2ARC required, on the order of 4-6GB RAM or more per TB of storage (*) on two raidz-1 (similar to raid-5) pools with 4x1TB drives each in my home server, and another two raidz-1 pools with 4x2TB each in a zfs rsync backup server i built at work. and several experimental zfs VMs (running in kvm on a zfs zvol with numerous additional zvols added to build their own zpools with). (**) on the whole, de-duping is one of those things that sounds like a great feature but isn't all that compelling in practice. it's cheaper and far more effective to add more disks than to add the extra RAM required - even with the current price of 8GB sticks. craig -- craig sanders <cas@taz.net.au> BOFH excuse #435: Internet shut down due to maintenance

On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
ZFS can also do de-duping(**) but vast quantities of RAM & L2ARC required, on the order of 4-6GB RAM or more per TB of storage
http://configure.ap.dell.com/dellstore/config.aspx?oc=u421102au&c=au&l=en&s=... t110-2 If you have a need for 15k rpm SAS disks then you are limited to a maximum size of about 600G per disk. If you get a server like a Dell PowerEdge T110 then you have a maximum of 4 disks (1.8TB of RAID-5) and 32G of RAM. So using 7-11G of RAM to save some of that 1.8T of storage could be useful. There are a bunch of options from Dell for servers (and similar options from other companies) and the ratios of RAM to storage for most of the systems with internal disks such that if you want RAID-5 or RAID-6 of SAS disks then you can use 4-6G of RAM per TB of storage and still have plenty spare. Of course if you want to use large SATA disks with SSD and other forms of cache then things are totally different. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Thu, Apr 05, 2012 at 09:54:45PM +1000, Russell Coker wrote:
On Thu, 5 Apr 2012, Craig Sanders <cas@taz.net.au> wrote:
ZFS can also do de-duping(**) but vast quantities of RAM & L2ARC required, on the order of 4-6GB RAM or more per TB of storage
http://configure.ap.dell.com/dellstore/config.aspx?oc=u421102au&c=au&l=en&s=...
If you have a need for 15k rpm SAS disks then you are limited to a maximum size of about 600G per disk. If you get a server like a Dell PowerEdge T110 then you have a maximum of 4 disks (1.8TB of RAID-5) and 32G of RAM. So using 7-11G of RAM to save some of that 1.8T of storage could be useful.
yeah, well, i'm totally unconvinced that 15k RPM SAS drives actually provide any noticable performance benefit that's even close to being worth the price/GB, even compared to the price of enterprise-quality 1,2, or 3TB 7200rpm SATA drives...and even more so when compared to consumer quality SATAs(*). especially when you can use SSDs for read and write caching of your drives (buffer your random writes through an SSD as with ZFS ZiL and they end up being mostly sequential writes). sure, you can use those SSDs to cache 15k drives too...but is the result significantly better than using the same SSDs to cache 7200rpm drives? sometimes price doesn't matter and the highest possible I/O performance at any price is required. those times are very rare. far more rare than the salesmen with the slick $150K+ SAN brochures would have you believe. and even then, IMO, you'd be better off using, say, 240GB SSDs rather than 300GB SAS drives - 20% less storage but many times the IOPS, for roughly the same price. even 500-ish GB SSDs aren't that much more than 600GB 15k SAS disks, about $860 for a 480GB Intel 520 vs about $575 for an IBM 15k 600GB SAS....80,000 IOPS vs what, maybe 1000? and an Intel 520 is far from the best performing SSD around. if you've got thousands to spend then a TB or two of SSD on a PCI-e card (no SATA or SAS bottleneck) beats any currently available SAS or SATA disk or SSD by a huge margin. you can also fit more 2.5" SSDs in a 1 or 2RU server than 3.5" SAS drives. if top performance is your requirement then, IMO, 15k drives are the wrong answer. (*) a large part of the point of RAID is that it is a Redundant Array of *Inexpensive* Disks. enterprise drives fail on that particular point. the disks are meant to be cheap and replacable commodity parts.
Of course if you want to use large SATA disks with SSD and other forms of cache then things are totally different.
true. and for de-duping, ZFS will use your L2ARC (e.g. SSD) as well as your ARC (RAM) for the dupe hash tables. I still don't think it's worthwhile in the general case. 8GB sticks are cheap enough now that I could upgrade my home server from 16GB to 32GB for not too much money...but even though one of my zpools has a LOT of duplicate data (rsync backups of linux systems) I still don't think it's worth the bother. i'd rather use that extra RAM for disk caching or for VMs. and upgrade the backup zpool from 4x1TB to 4x2TB. or just save the money and wait for the inevitable improvements :) craig -- craig sanders <cas@taz.net.au>

and even then, IMO, you'd be better off using, say, 240GB SSDs rather than 300GB SAS drives - 20% less storage but many times the IOPS, for roughly the same price. even 500-ish GB SSDs aren't that much more than 600GB 15k SAS disks, about $860 for a 480GB Intel 520 vs about $575 for an IBM 15k 600GB SAS....80,000 IOPS vs what, maybe 1000?
According to the HP Configureaider: 300GB 2.5" 10KRPM SAS - $422 300GB 2.5" 15KRPM SAS - $772 200GB 2.5" MLC SSD SAS - $4696 I'd be careful about sticking your $860 SSD into a server if you require any sort of write performance or durability, you might find you get just what you paid for. OTOH, if your workload is read-only (or read-mostly), a cheap(er) SSD may be well worth the investment vs 15KRPM disks.
(*) a large part of the point of RAID is that it is a Redundant Array of *Inexpensive* Disks. enterprise drives fail on that particular point. the disks are meant to be cheap and replacable commodity parts.
... or Redundant Array of *Independent* Disks, as if the name tells you what sort of disks you should be using anyway. At 15KRPM you can read a single track in half the time, and therefore twice the speed. You would also get additional gyroscopic stability although I don't know if that makes a difference in reality... someone (on LUV I think) mentioned that consumer grade disks performance suffered much more when placed in an environment with vibration (eg adjacent to other seeking disks in a server). James

On Fri, 6 Apr 2012, James Harper <james.harper@bendigoit.com.au> wrote:
and even then, IMO, you'd be better off using, say, 240GB SSDs rather than 300GB SAS drives - 20% less storage but many times the IOPS, for roughly the same price. even 500-ish GB SSDs aren't that much more than 600GB 15k SAS disks, about $860 for a 480GB Intel 520 vs about $575 for an IBM 15k 600GB SAS....80,000 IOPS vs what, maybe 1000?
According to the HP Configureaider:
300GB 2.5" 10KRPM SAS - $422 300GB 2.5" 15KRPM SAS - $772 200GB 2.5" MLC SSD SAS - $4696
I'd be careful about sticking your $860 SSD into a server if you require any sort of write performance or durability, you might find you get just what you paid for.
Even if the $4696 SSD doesn't happen to be better than the $960 SSD, there's the issue that many companies want to buy everything from the same place. Sometimes if HP (or whoever the preferred provider is) doesn't sell it then it's not going in the server and we just have to work with that.
OTOH, if your workload is read-only (or read-mostly), a cheap(er) SSD may be well worth the investment vs 15KRPM disks.
(*) a large part of the point of RAID is that it is a Redundant Array of *Inexpensive* Disks. enterprise drives fail on that particular point. the disks are meant to be cheap and replacable commodity parts.
... or Redundant Array of *Independent* Disks, as if the name tells you what sort of disks you should be using anyway.
http://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf The original paper on RAID says that the I stands for "Inexpensive" and contrasts it to "Single Large Expensive Disks (SLED)".
At 15KRPM you can read a single track in half the time, and therefore twice the speed.
You can read a track in half the time and on average read one sector in half the time once the heads are on the correct track. The head movement time won't be changed due to a different rotational speed. However the SCSI/SAS disks have bigger magnets for moving the heads (open some drives and inspect them) so the head movement time is probably lower in such disks. I will have to bring some dead disks to a LUV meeting so interested people can look inside them. Would anyone be interested in this before the next meeting?
You would also get additional gyroscopic stability although I don't know if that makes a difference in reality...
Such additional gyroscope action would mean more forces on the spindle and more possibility for things to wear out.
someone (on LUV I think) mentioned that consumer grade disks performance suffered much more when placed in an environment with vibration (eg adjacent to other seeking disks in a server).
The only case I'm directly aware of where disks suffered badly in production from this concerned enterprise grade disks, and the cause was vibration from system cooling fans - other systems of the same make and model didn't cause the same performance problems due to manufacturing differences in the fans. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Thu, Apr 05, 2012 at 11:57:05PM +0000, James Harper wrote:
According to the HP Configureaider:
300GB 2.5" 10KRPM SAS - $422 300GB 2.5" 15KRPM SAS - $772 200GB 2.5" MLC SSD SAS - $4696
that SSD price alone is *precisely* why i'm dubious about the claims of enterprise vendors. there is just no valid or justifiable reason why an SAS MLC SSD should cost anywhere near that much. I can partially buy the argument that higher quality magnetic disks have substantially increased manufacturing costs, but not to the point of believing that they're worth it. SSDs have no such manufacturing-cost excuse (higher quality parts may justify a few percent extra, perhaps as much as 5% but the manufacturing processes and tolerances would be the same). i think it's also subverting the point of raid (to have an array of cheap disks so that one or more disk failures won't lose your data) for nothing but commercial gain for the vendor. more to the point, i think it's shameless profiteering from peoples' fairly natural CYA motivation, which leads to decisions to buy the 2 or 5 or 10 or more times as expensive "enterprise"-labelled product even though it's only a few percent better or a few percent less likely to fail. this is compounded when consultants get involved - their percentage markup on overpriced goods is far fatter than the same markup on reasonably priced goods. thus they are motivated to keep the gravy-train rolling. (OK, this is only an issue for me personally at work where i'm sometimes only allowed to use such absurdly overpriced gear. it's not my money, so i suppose i shouldn't care....but it offends me to see so much money wasted on so little, when the money could be better used for other projects or other things. or, worse, to see a project completely derailed because someone higher up wants to enterprisify the project and it dies or gets lost in a budget approval committee - same thing - because it has changed from costing a few hundred or a few thousand dollars to tens of thousands). IMO there are two key words required for successful raid - 1. "cheap", 2. "lots". "expensive" contradicts both of them. sure, if the price difference was reasonable (say 10, 20, or even 30% more), i'd choose the "enterprise" drives over the cheaper ones. but the difference is not reasonable. not even close. consider, for example, the 300GB drive for $422 above. to get a raid6 array of 3TB you'd need of at least 12 of them. $5064. To get a raid array of 3TB in consumer SATA disks, you could have 2 x 3TB @ $200 in RAID-1 ($400), or 5 x 1TB @ $95 in RAID6 ($475). Both options cost about the same as a *single* 300GB 10K SAS drive. you can also add more identical drives in RAID-1 for more speed and redundancy (e.g. 12x$200 = $2400, half the price of the 12-disk array of 300GB SAS disks), or add them for more speed, capacity, and redundancy in RAID6. in the 12x3TB raid-1 example, it doesn't matter if 11 of those cheaper drives die. you've still got a complete copy of all your data on the remaining drive. you also get the speed of all 12 disks for writes (vs the speed of only 10 for raid6, 2 are lost to parity) for raid6, well, you can only afford to lose 2 drives so buy some extra hot and cold spares. so, comparing 12x3TB SATA raid1 vs 12x300GB raid6 - is it really worth paying more than twice as much for 2/11th of the redundancy and 5/6ths of the read performance (and significantly impaired write performance)? or if you were using 12 x 240GB mid-range consumer-grade SSDs, you'd get 80% of the capacity and *thousands* of times the performance for half the price of the 10k 300GB SAS drives. and that's comparing against the price of only 10K rpm drives. The figures are much worse for 15K rpm drives.
I'd be careful about sticking your $860 SSD into a server if you require any sort of write performance or durability, you might find you get just what you paid for.
yep, pretty much identical performance and reliability without the "rip me off, i'm an idiot" price tag. also note comments about raid and redundancy above. i can afford to buy a lot more redunancy if the unit prices are cheaper. btw, resyncing a raid array of SSDs is far less likely to overstress and kill the remaining drives than resyncing an array of magnetic drives - no moving parts, random access, not as heat sensitive, much smaller time window until the array is fully synced, etc.
(*) a large part of the point of RAID is that it is a Redundant Array of *Inexpensive* Disks. enterprise drives fail on that particular point. the disks are meant to be cheap and replacable commodity parts.
... or Redundant Array of *Independent* Disks, as if the name tells you what sort of disks you should be using anyway.
Russell's already addressed that. It IS supposed to be Inexpensive, but enterprise vendors have distorted that for their own benefit...apparently with some success.
At 15KRPM you can read a single track in half the time, and therefore twice the speed.
with an SSD you don't need to wait for the head to get around to the right part of the disk to start reading again. read as many blocks as you want from wherever you want without rotational or head-movement delays. and without the heat caused by the platters spinning at such high speeds, or the extra power consumption of same.
You would also get additional gyroscopic stability although I don't know if that makes a difference in reality... someone (on LUV I think) mentioned that consumer grade disks performance suffered much more when placed in an environment with vibration (eg adjacent to other seeking disks in a server).
*ALL* mechanical disks can and will suffer from vibration problems if the environment they're in or their mounting is sufficiently bad. SSDs won't, of course. craig -- craig sanders <cas@taz.net.au> BOFH excuse #106: The electrician didn't know what the yellow cable was so he yanked the ethernet out.

On Thu, 5 Apr 2012, Craig Sanders wrote:
Of course if you want to use large SATA disks with SSD and other forms of cache then things are totally different.
true. and for de-duping, ZFS will use your L2ARC (e.g. SSD) as well as your ARC (RAM) for the dupe hash tables. I still don't think it's worthwhile in the general case.
8GB sticks are cheap enough now that I could upgrade my home server from 16GB to 32GB for not too much money...but even though one of my zpools has a LOT of duplicate data (rsync backups of linux systems) I still don't think it's worth the bother. i'd rather use that extra RAM for disk caching or for VMs. and upgrade the backup zpool from 4x1TB to 4x2TB. or just save the money and wait for the inevitable improvements :)
Or use the proper software for the job. backuppc with rsync already dedups, and since it knows the use-case it operates on (it knows files are going to be identical between backups, and it knows that individual blocks are not likely to be duplicates and hence are irrelevant to test against), it can do it with far fewer resources (my backuppc server is a laptop with an esata disk, and the retention time is far greater than the time I've had it in operation. My oldest backups are from December 2008) -- Tim Connors

On Fri, Apr 06, 2012 at 06:03:16PM +1000, Tim Connors wrote:
On Thu, 5 Apr 2012, Craig Sanders wrote:
[...deduping uses lots of RAM and/or L2ARC on zfs. not worth it...]
Or use the proper software for the job. [...] backuppc with rsync already dedups,
i've tried backuppc. i didn't like it. performance was abysmal, and the backup was inaccessible via normal filesystem tools. which, to me, defeats the purpose of online disk-based backups. see prev. thread on this list (with misspelt "backkuppc" in subject, with two "k"s or something like that because i mistyped it). also, IMO, zfs + rsync + snapshots do a MUCH better job. or 'zfs send | zfs receive' when backing up a zfs mount. IMO, backupppc's hard-link farm and the loss of normal filesystem level access is a far worse price to pay for de-duping than lots of RAM...so is even less worthwhile than doing it on ZFS. disks are huge and cheap these days. the problem that backuppc's deduping feature was written to circumvent is not actually a problem any more. craig -- craig sanders <cas@taz.net.au>

On Thu, 5 Apr 2012, Craig Sanders wrote:
Which would make a total of 12 drives in the machine. More than it can hold, and more than my 8 sata ports will be happy with. If anything, I'll contemplate doing this with 3TB drives, once they drop in price enough.
yeah, i'm waiting for 3TB drives to get around the $100 mark before i upgrade my zpools. they were slowly heading in that direction before the thailand floods last year but have now stabilised at around $200. maybe in a year or so.
MSY has WD Green 3TB for $195, and Seagate 3TB (barracuda, i think) for $219 - WD Green drives are OK but be wary of TLER issues with a raid-card in JBOD mode rather then IT mode.
Should be sooner than that. ozbargains the other day listed either dick smith or jb hifi of all places, an *external* 3TB drive for $160 or so. How they can be cheaper with plastic case, powersupply and usb/esata electronics, I can't fathom. -- Tim Connors

On Fri, Apr 06, 2012 at 05:46:29PM +1000, Tim Connors wrote:
Should be sooner than that. ozbargains the other day listed either dick smith or jb hifi of all places, an *external* 3TB drive for $160 or so. How they can be cheaper with plastic case, powersupply and usb/esata electronics, I can't fathom.
it might not be a 3TB drive. might be something like 2 x 1.5TB from some disk manufacturers old, excess stock with a dodgy raid0 adapter between the usb interface and the drives. craig -- craig sanders <cas@taz.net.au> BOFH excuse #240: Too many little pins on CPU confusing it, bend back and forth until 10-20% are neatly removed. Do _not_ leave metal bits visible!

On 05/04/12 19:52, Craig Sanders wrote:
MSY has WD Green 3TB for $195, and Seagate 3TB (barracuda, i think) for $219 - WD Green drives are OK but be wary of TLER issues with a raid-card in JBOD mode rather then IT mode.
Avoid the WD "Green" drives -- not only does the performance suck, but WD have messed around with the power saving options. You can't adjust them from the OS via smartctl or hdparm any more, and by default the drive will spin down *every eight seconds* if idle. Even on a mostly-idle Linux system, this results in a spin-up at least once per minute. Yeah. Expect your spindle motor to burn out in a few months. You can adjust it via a special WD utility, but it only works if you reboot into a DOS shell. And, as I said, the performance sucks. Seagate have some 5400 energy-efficient disks too, but reports on the internet suggest their firmware is both buggy AND has terrible performance. (It tries to realign sector writes to match the 4k format, but doesn't seem to do it well) The Samsung 2TB 5400 rpm drives seem like the best option at the moment. I currently have one of those, and two WD greens, and even though they're all correctly aligned for the 4k sectors and stuff, the Samsung is at least 50% faster! Do watch out for that sector alignment gotcha on Linux.. New tools tend to get it right, old ones won't. -Toby

On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
Avoid the WD "Green" drives -- not only does the performance suck, but WD have messed around with the power saving options. You can't adjust them from the OS via smartctl or hdparm any more, and by default the drive will spin down every eight seconds if idle.
Even on a mostly-idle Linux system, this results in a spin-up at least once per minute. Yeah. Expect your spindle motor to burn out in a few months.
When did WD do that? I've got a 1TB WD green drive that doesn't have such problems. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 10/04/12 11:33, Russell Coker wrote:
On Tue, 10 Apr 2012, Toby Corkindale<toby.corkindale@strategicdata.com.au> wrote:
Avoid the WD "Green" drives -- not only does the performance suck, but WD have messed around with the power saving options. You can't adjust them from the OS via smartctl or hdparm any more, and by default the drive will spin down every eight seconds if idle.
Even on a mostly-idle Linux system, this results in a spin-up at least once per minute. Yeah. Expect your spindle motor to burn out in a few months.
When did WD do that? I've got a 1TB WD green drive that doesn't have such problems.
All of them, I think -- I've seen it on 750G, 1500G and 2000G drives at least. What does the output of this look like on your drives? smartctl -a /dev/sdb | grep Load_Cycle_Count (replace /dev/sdb with appropriate drive(s)) Toby

On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
What does the output of this look like on your drives?
smartctl -a /dev/sdb | grep Load_Cycle_Count
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1284384 So if that's not lying (and SMART data is often false) then the drive would have spun down an average of about 30 times per hour during the life of the system. I've just tried "hdparm -S0 /dev/sdb" but that hasn't stopped the Load Cycle Count from increasing. So it seems that your claims are correct. I guess that my WD Green drive is just quiet enough that it can spin down and up without me noticing. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On 10/04/12 12:21, Russell Coker wrote:
On Tue, 10 Apr 2012, Toby Corkindale<toby.corkindale@strategicdata.com.au> wrote:
What does the output of this look like on your drives?
smartctl -a /dev/sdb | grep Load_Cycle_Count
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1284384
So if that's not lying (and SMART data is often false) then the drive would have spun down an average of about 30 times per hour during the life of the system.
Nope, it's not lying -- your drive really has spun down and up again 1.2 million times. The drives are specced as being good for 300k load cycles, so you've done well! But you're also living on borrowed time now.. Also consider that every time it spins up, you'll get a delay on that read or write, which *might* cause a RAID system to kick the drive out of the array, too.
I've just tried "hdparm -S0 /dev/sdb" but that hasn't stopped the Load Cycle Count from increasing. So it seems that your claims are correct.
I guess that my WD Green drive is just quiet enough that it can spin down and up without me noticing.
Yeah, they're good on the acoustic side of things, and they don't get too hot either, compared to 7200 rpm drives. If you pull the drive out and put your ear to it, you can clearly hear it spinning up and down though. Toby

On 10/04/12 12:21, Russell Coker wrote:
On Tue, 10 Apr 2012, Toby Corkindale<toby.corkindale@strategicdata.com.au> wrote:
What does the output of this look like on your drives?
smartctl -a /dev/sdb | grep Load_Cycle_Count
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1284384
So if that's not lying (and SMART data is often false) then the drive would have spun down an average of about 30 times per hour during the life of the system.
Nope, it's not lying -- your drive really has spun down and up again 1.2 million times. The drives are specced as being good for 300k load cycles, so you've done well! But you're also living on borrowed time now..
Further to my previous email about Load_Cycle_Count, Wikipedia says this: " Count of load/unload cycles into head landing zone position.[19] The typical lifetime rating for laptop (2.5-in) hard drives is 300,000 to 600,000 load cycles.[20] Some laptop drives are programmed to unload the heads whenever there has not been any activity for about five seconds.[21] Many Linux installations write to the file system a few times a minute in the background.[22] As a result, there may be 100 or more load cycles per hour, and the load cycle rating may be exceeded in less than a year.[23] " Maybe you are thinking of Start_Stop_Count for spindle spin up and down cycles?? James

On 10/04/12 12:38, James Harper wrote:
On 10/04/12 12:21, Russell Coker wrote:
On Tue, 10 Apr 2012, Toby Corkindale<toby.corkindale@strategicdata.com.au> wrote:
What does the output of this look like on your drives?
smartctl -a /dev/sdb | grep Load_Cycle_Count
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1284384
So if that's not lying (and SMART data is often false) then the drive would have spun down an average of about 30 times per hour during the life of the system.
Nope, it's not lying -- your drive really has spun down and up again 1.2 million times. The drives are specced as being good for 300k load cycles, so you've done well! But you're also living on borrowed time now..
Further to my previous email about Load_Cycle_Count, Wikipedia says this:
" Count of load/unload cycles into head landing zone position.[19]
The typical lifetime rating for laptop (2.5-in) hard drives is 300,000 to 600,000 load cycles.[20] Some laptop drives are programmed to unload the heads whenever there has not been any activity for about five seconds.[21] Many Linux installations write to the file system a few times a minute in the background.[22] As a result, there may be 100 or more load cycles per hour, and the load cycle rating may be exceeded in less than a year.[23] "
Maybe you are thinking of Start_Stop_Count for spindle spin up and down cycles??
Yep, I was thinking of Start_Stop_Count in terms of what it did. Sorry. Still, the WD 3.5" drives are supposedly only rated for 300k *load cycles*, so it's still a problem that the drives rack them up so amazingly quickly. It looks like someone has reversed engineered the DOS utility to disable/adjust the feature, though, so you can do it from Linux now. http://idle3-tools.sourceforge.net/ -Toby

On Tue, Apr 10, 2012 at 12:46:56PM +1000, Toby Corkindale wrote:
It looks like someone has reversed engineered the DOS utility to disable/adjust the feature, though, so you can do it from Linux now. http://idle3-tools.sourceforge.net/
thanks for that. compiled and installed fine on all my systems with WD drives. have disabled the idle timeout. now to power-cycle the machines (apparently a reboot isn't enough). might stop my WD drives being occasionally booted from the zpool by my SAS card. craig -- craig sanders <cas@taz.net.au> BOFH excuse #333: A plumber is needed, the network drain is clogged

On Tuesday 10 April 2012 12:46:56 Toby Corkindale wrote:
It looks like someone has reversed engineered the DOS utility to disable/adjust the feature, though, so you can do it from Linux now. http://idle3-tools.sourceforge.net/
Nice find, though it looks like it doesn't work with USB connected WD Green drives (6400AAV)... :-( idle3-tools-0.9.1$ sudo ./idle3ctl -g /dev/sdb sg16(VSC_ENABLE) failed: Invalid exchange idle3-tools-0.9.1$ sudo smartctl -A /dev/sdb | grep "^193" 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2571 -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

Chris Samuel wrote:
On Tuesday 10 April 2012 12:46:56 Toby Corkindale wrote:
It looks like someone has reversed engineered the DOS utility to disable/adjust the feature, though, so you can do it from Linux now. http://idle3-tools.sourceforge.net/
Nice find, though it looks like it doesn't work with USB connected WD Green drives (6400AAV)... :-(
idle3-tools-0.9.1$ sudo ./idle3ctl -g /dev/sdb sg16(VSC_ENABLE) failed: Invalid exchange
idle3-tools-0.9.1$ sudo smartctl -A /dev/sdb | grep "^193" 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2571
What USB-SATA bridge are you using? IME most/all don't pass SMART correctly.

On Friday 20 April 2012 22:01:55 Trent W. Buck wrote:
What USB-SATA bridge are you using?
The one built into the drive - I just plug a USB cable into it.. :-)
IME most/all don't pass SMART correctly.
Yeah, I suspect that's quite likely what's happening here. Damn. :-( cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP

On 20/04/12 16:25, Chris Samuel wrote:
On Tuesday 10 April 2012 12:46:56 Toby Corkindale wrote:
It looks like someone has reversed engineered the DOS utility to disable/adjust the feature, though, so you can do it from Linux now. http://idle3-tools.sourceforge.net/
Nice find, though it looks like it doesn't work with USB connected WD Green drives (6400AAV)... :-(
If it's any consolation, neither does the original utility from WD, since it only works in DOS, which didn't support USB drives at all.
idle3-tools-0.9.1$ sudo ./idle3ctl -g /dev/sdb sg16(VSC_ENABLE) failed: Invalid exchange
idle3-tools-0.9.1$ sudo smartctl -A /dev/sdb | grep "^193" 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2571
2571 isn't that high, really.. The drives are rated to 300k load cycles*, and as we've seen from Russell, they don't spontaneously combust even with >1M cycles. Maybe the USB drive is set up differently? * The 300k rating is quoted all around the internet, but I never did manage to find an authoritative source from WD themselves.

On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
Nope, it's not lying -- your drive really has spun down and up again 1.2 million times. The drives are specced as being good for 300k load cycles, so you've done well! But you're also living on borrowed time now..
Also consider that every time it spins up, you'll get a delay on that read or write, which might cause a RAID system to kick the drive out of the array, too.
So far in 4 years I haven't had the drive kicked out of the array so it seems unlikely to happen. It could be that some latency problems on that system are related to drive spin-up. I've set the drive in question to write-mostly which may alleviate that. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
What does the output of this look like on your drives?
smartctl -a /dev/sdb | grep Load_Cycle_Count
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1284384
So if that's not lying (and SMART data is often false) then the drive would have spun down an average of about 30 times per hour during the life of the system.
I'm pretty sure that Load_Cycle_Count is not the platters spinning up and down, it's the heads parking away from the platters which I seem to recall someone said was to reduce drag and therefore save a few precious electrons, or maybe for some other reason. James

On Tue, 10 Apr 2012 12:21:30 pm Russell Coker wrote:
On Tue, 10 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au>
wrote:
What does the output of this look like on your drives?
smartctl -a /dev/sdb | grep Load_Cycle_Count
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1284384
So if that's not lying (and SMART data is often false) then the drive would have spun down an average of about 30 times per hour during the life of the system.
I've just tried "hdparm -S0 /dev/sdb" but that hasn't stopped the Load Cycle Count from increasing. So it seems that your claims are correct.
I guess that my WD Green drive is just quiet enough that it can spin down and up without me noticing.
See also https://wiki.archlinux.org/index.php/Advanced_Format#Special_Consideration_f... -- Anthony Shipman Mamas don't let your babies als@iinet.net.au grow up to be outsourced.

Toby Corkindale wrote:
Do watch out for that sector alignment gotcha on Linux.. New tools tend to get it right, old ones won't.
That is the case as far as partition alignment at 4k blocks (parted /dev/sda check-align optimal 1), but Ty Ts'o had a blog post indicating it was non-trivial to also align the mdadm and lvm layers. That was for the much larger SSD write-erase blocks, but IIRC it would apply equally to 4k blocks. Can't find the article now, though... :-/

On Tue, 10 Apr 2012, Trent W. Buck wrote:
Toby Corkindale wrote:
Do watch out for that sector alignment gotcha on Linux.. New tools tend to get it right, old ones won't.
That is the case as far as partition alignment at 4k blocks (parted /dev/sda check-align optimal 1), but Ty Ts'o had a blog post indicating it was non-trivial to also align the mdadm and lvm layers. That was for the much larger SSD write-erase blocks, but IIRC it would apply equally to 4k blocks. Can't find the article now, though... :-/
http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase... But poofully, right now that's a "connection refused". So interweb archive it is then! http://web.archive.org/web/20101224211915/http://thunk.org/tytso/blog/2009/0... -- Tim Connors

On Tue, 10 Apr 2012, Tim Connors wrote:
On Tue, 10 Apr 2012, Trent W. Buck wrote:
Toby Corkindale wrote:
Do watch out for that sector alignment gotcha on Linux.. New tools tend to get it right, old ones won't.
That is the case as far as partition alignment at 4k blocks (parted /dev/sda check-align optimal 1), but Ty Ts'o had a blog post indicating it was non-trivial to also align the mdadm and lvm layers. That was for the much larger SSD write-erase blocks, but IIRC it would apply equally to 4k blocks. Can't find the article now, though... :-/
http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase...
But poofully, right now that's a "connection refused".
So interweb archive it is then!
http://web.archive.org/web/20101224211915/http://thunk.org/tytso/blog/2009/0...
Ho Ho Ho Ho. "And avoiding fsync() in applications will also be helpful, since a cache flush operation will force the SSD to write to an erase block even if it isn’t completely filled." (if you can't tell, I have a pet hate for people that insist fsync() is the only true way to ensure data integrity. Use a reliable filesystem instead!) -- Tim Connors

On 10/04/12 16:44, Tim Connors wrote:
(if you can't tell, I have a pet hate for people that insist fsync() is the only true way to ensure data integrity. Use a reliable filesystem instead!)
Soooo.... if you aren't using fsync(), then what happens if you write() to a file, say your database of orders, then write to the HTTP socket to tell the user their order went through, and then you have a power failure ten seconds later, before the operating system has got around to flushing that data to disk?

On Tue, 10 Apr 2012, Toby Corkindale wrote:
On 10/04/12 16:44, Tim Connors wrote:
(if you can't tell, I have a pet hate for people that insist fsync() is the only true way to ensure data integrity. Use a reliable filesystem instead!)
Soooo.... if you aren't using fsync(), then what happens if you write() to a file, say your database of orders, then write to the HTTP socket to tell the user their order went through, and then you have a power failure ten seconds later, before the operating system has got around to flushing that data to disk?
Sure. You're doing transactions. Do transactions properly. Probably outside of a filesystem alltogether - put your database on a raw block device. What if my desktop or mozilla have just updated a config file and then the power fails? Should I have to wait 40 seconds in which the entire system activity freezes, and I have no gaurantee which version I get, or should I have to wait no time at all and have no guarantee which version I get? As long as it is atomically renamed either way, I'll happily go with the version that doesn't unnecessarily spin up my spundown disks, and doesn't cause me to wait for something unimportant to happen. -- Tim Connors

On Tue, 10 Apr 2012, Toby Corkindale wrote:
On 10/04/12 16:44, Tim Connors wrote:
(if you can't tell, I have a pet hate for people that insist fsync() is the only true way to ensure data integrity. Use a reliable filesystem instead!)
Soooo.... if you aren't using fsync(), then what happens if you write() to a file, say your database of orders, then write to the HTTP socket to tell the user their order went through, and then you have a power failure ten seconds later, before the operating system has got around to flushing that data to disk?
Sure. You're doing transactions. Do transactions properly. Probably outside of a filesystem alltogether - put your database on a raw block device.
What if my desktop or mozilla have just updated a config file and then the power fails? Should I have to wait 40 seconds in which the entire system activity freezes, and I have no gaurantee which version I get, or should I have to wait no time at all and have no guarantee which version I get? As long as it is atomically renamed either way, I'll happily go with the version that doesn't unnecessarily spin up my spundown disks, and doesn't cause me to wait for something unimportant to happen.
I have a SATA disk in my laptop with 4G of SSD onboard as a cache. It can stay spun down for a while even if writes are done - no need to spin it up again until the cache is nearly full. I'd be focussing more on your storage medium than your filesystem. The problem with omitting the fsync in the case of an application is that you are violating your contract with the user. If I save a document then when the application says that the save is complete my document had damn well better be on my memory stick. v1.0 of AmigaOS was a classic example of this - it would indicate that a disk operation was complete and tell the user they could eject the disk while it was still writing the final few sectors - someone forgot the (AmigaOS equivalent of) fsync(). OTOH, sticking fsyncs everywhere in your code just because it seems like a good idea is to be frowned upon... James

James Harper <james.harper@bendigoit.com.au> wrote:
The problem with omitting the fsync in the case of an application is that you are violating your contract with the user. If I save a document then when the application says that the save is complete my document had damn well better be on my memory stick.
It is my (perhaps faulty) understanding that most editors omit the fsync() and that the file will only be on your memory stick when you unmount it in the typical case. If the memory stick is removed without being unmounted properly then I expect the file system to be corrupt and do not expect recently written data to be intact.

On Tue, 10 Apr 2012, Jason White <jason@jasonjgw.net> wrote:
It is my (perhaps faulty) understanding that most editors omit the fsync() and that the file will only be on your memory stick when you unmount it in the typical case.
open("test", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 write(4, "test\n\n", 6) = 6 fsync(4) = 0 stat("test", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0 close(4) I've just run strace on vim and it calls fsync. If most editors omit the fsync() as you claim then please run strace and find us an example of one - and file a bug report while you are at it. ;) -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Russell Coker wrote:
On Tue, 10 Apr 2012, Jason White <jason@jasonjgw.net> wrote:
It is my (perhaps faulty) understanding that most editors omit the fsync() and that the file will only be on your memory stick when you unmount it in the typical case.
open("test", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 write(4, "test\n\n", 6) = 6 fsync(4) = 0 stat("test", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0 close(4)
I've just run strace on vim and it calls fsync.
If most editors omit the fsync() as you claim then please run strace and find us an example of one - and file a bug report while you are at it. ;)
FYI, execve("/usr/bin/emacs", ["emacs", "-Q", "/tmp/y"], [/* 46 vars */]) = 0 [...] access("/tmp/", W_OK) = 0 lstat64("/tmp/.#y", {st_mode=S_IFLNK|0777, st_size=25, ...}) = 0 stat64("/tmp/y", 0xbe9b7c48) = -1 ENOENT (No such file or directory) symlink("twb@elba.12820:1333989873", "/tmp/.#y") = -1 EEXIST (File exists) readlink("/tmp/.#y", "twb@elba.12820:1333989873"..., 100) = 25 open("/tmp/y", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 4 write(4, "fnord.", 6) = 6 fsync(4) = 0 close(4) = 0 stat64("/tmp/y", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0 lstat64("/tmp/.#y", {st_mode=S_IFLNK|0777, st_size=25, ...}) = 0 readlink("/tmp/.#y", "twb@elba.12820:1333989873", 100) = 25 unlink("/tmp/.#y") = 0 write(3, "\33[49;1H\33[?25lWrote /tmp/y\33[K\33[2;"..., 50) = 50 lstat64("/tmp/y", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0 [...]

Trent W. Buck <trentbuck@gmail.com> wrote:
FYI,
execve("/usr/bin/emacs", ["emacs", "-Q", "/tmp/y"], [/* 46 vars */]) = 0 [...] access("/tmp/", W_OK) = 0 lstat64("/tmp/.#y", {st_mode=S_IFLNK|0777, st_size=25, ...}) = 0 stat64("/tmp/y", 0xbe9b7c48) = -1 ENOENT (No such file or directory) symlink("twb@elba.12820:1333989873", "/tmp/.#y") = -1 EEXIST (File exists) readlink("/tmp/.#y", "twb@elba.12820:1333989873"..., 100) = 25 open("/tmp/y", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 4 write(4, "fnord.", 6) = 6 fsync(4) = 0 close(4) = 0
Excellent. The two most important editors are covered.

On Wed, Apr 11, 2012 at 10:22:02AM +1000, Jason White wrote:
Excellent. The two most important editors are covered.
there are other important editors? craig -- craig sanders <cas@taz.net.au> BOFH excuse #69: knot in cables caused data stream to become twisted and kinked

On Wed, 11 Apr 2012, Trent W. Buck wrote:
Russell Coker wrote:
On Tue, 10 Apr 2012, Jason White <jason@jasonjgw.net> wrote:
It is my (perhaps faulty) understanding that most editors omit the fsync() and that the file will only be on your memory stick when you unmount it in the typical case.
open("test", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 write(4, "test\n\n", 6) = 6 fsync(4) = 0 stat("test", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0 close(4)
I've just run strace on vim and it calls fsync.
If most editors omit the fsync() as you claim then please run strace and find us an example of one - and file a bug report while you are at it. ;)
FYI,
execve("/usr/bin/emacs", ["emacs", "-Q", "/tmp/y"], [/* 46 vars */]) = 0 [...] access("/tmp/", W_OK) = 0 lstat64("/tmp/.#y", {st_mode=S_IFLNK|0777, st_size=25, ...}) = 0 stat64("/tmp/y", 0xbe9b7c48) = -1 ENOENT (No such file or directory) symlink("twb@elba.12820:1333989873", "/tmp/.#y") = -1 EEXIST (File exists) readlink("/tmp/.#y", "twb@elba.12820:1333989873"..., 100) = 25 open("/tmp/y", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 4 write(4, "fnord.", 6) = 6 fsync(4) = 0 close(4) = 0
Gah! When did they fix that "bug"? Maybe I have managed to turn that mode off with one of the many possible backup behaviours of emacs. Or maybe it explains why emacs has been so slow at autosaving lately. -- Tim Connors

On Wed, 11 Apr 2012, Trent W. Buck wrote:
Russell Coker wrote:
On Tue, 10 Apr 2012, Jason White <jason@jasonjgw.net> wrote:
It is my (perhaps faulty) understanding that most editors omit the fsync() and that the file will only be on your memory stick when you unmount it in the typical case.
open("test", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 write(4, "test\n\n", 6) = 6 fsync(4) = 0 stat("test", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0 close(4)
I've just run strace on vim and it calls fsync.
If most editors omit the fsync() as you claim then please run strace and find us an example of one - and file a bug report while you are at it. ;)
FYI,
execve("/usr/bin/emacs", ["emacs", "-Q", "/tmp/y"], [/* 46 vars */]) = 0 [...] access("/tmp/", W_OK) = 0 lstat64("/tmp/.#y", {st_mode=S_IFLNK|0777, st_size=25, ...}) = 0 stat64("/tmp/y", 0xbe9b7c48) = -1 ENOENT (No such file or directory) symlink("twb@elba.12820:1333989873", "/tmp/.#y") = -1 EEXIST (File exists) readlink("/tmp/.#y", "twb@elba.12820:1333989873"..., 100) = 25 open("/tmp/y", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 4 write(4, "fnord.", 6) = 6 fsync(4) = 0 close(4) = 0
Gah! When did they fix that "bug"?
Maybe I have managed to turn that mode off with one of the many possible backup behaviours of emacs. Or maybe it explains why emacs has been so slow at autosaving lately.
You're not the first one to have this thought:
From http://www.gnu.org/software/emacs/manual/html_node/emacs/Customize-Save.html
" When Emacs saves a file, it invokes the fsync system call to force the data immediately out to disk. This is important for safety if the system crashes or in case of power outage. However, it can be disruptive on laptops using power saving, because it requires the disk to spin up each time you save a file. Setting write-region-inhibit-fsync to a non-nil value disables this synchronization. Be careful-this means increased risk of data loss. " James

Tim Connors wrote:
On Wed, 11 Apr 2012, Trent W. Buck wrote:
Russell Coker wrote:
On Tue, 10 Apr 2012, Jason White <jason@jasonjgw.net> wrote:
It is my (perhaps faulty) understanding that most editors omit the fsync() and that the file will only be on your memory stick when you unmount it in the typical case.
open("test", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 write(4, "test\n\n", 6) = 6 fsync(4) = 0 stat("test", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0 close(4)
I've just run strace on vim and it calls fsync.
If most editors omit the fsync() as you claim then please run strace and find us an example of one - and file a bug report while you are at it. ;)
FYI,
execve("/usr/bin/emacs", ["emacs", "-Q", "/tmp/y"], [/* 46 vars */]) = 0 [...] fsync(4) = 0
Gah! When did they fix that "bug"?
I don't know. It surprised me as well. That test was from GNU Emacs 23.3.1 (arm-unknown-linux-gnueabi) of 2011-08-15 on cushaw, modified by Debian Since emacs switched to bzr, it now takes 750MB of RAM and about three hours to clone the repo, and I don't have the resources to do that. (There is a git mirror, but ICBF fetching it just to check for you.)
Maybe I have managed to turn that mode off with one of the many possible backup behaviours of emacs. Or maybe it explains why emacs has been so slow at autosaving lately.
Dunno; note I was testing with -Q.

On Tue, 10 Apr 2012, Russell Coker wrote:
On Tue, 10 Apr 2012, Jason White <jason@jasonjgw.net> wrote:
It is my (perhaps faulty) understanding that most editors omit the fsync() and that the file will only be on your memory stick when you unmount it in the typical case.
open("test", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 write(4, "test\n\n", 6) = 6 fsync(4) = 0 stat("test", {st_mode=S_IFREG|0644, st_size=6, ...}) = 0 close(4)
I've just run strace on vim and it calls fsync.
If most editors omit the fsync() as you claim then please run strace and find us an example of one - and file a bug report while you are at it. ;)
Please don't (I had to set vim to turn off fsync. I don't have to override emacs, fortunately). Some of us like our drives to remain spun down, regardless of any autosave behaviour of your editor (I could maybe tolerate fsync for hard save and not using fsync for autosave, but I still don't like my editor to delay for 5 seconds while it waits for a drive to spin up (or delay for 2 minutes while it waits for all the disk contention caused by bloatware such as mozilla) just because my fingers typed C-x C-s before asking my brain for permission). Don't forget that sometimes (or often, in my case), we're editing over fuse filesystems to laggy ssh connections on the other side of the country. I *really* want filesystem operations to be asyncronous in such cases. As long as the file is either the new or the old version, and not zeroed out because of braindead filesystem behaviour, it makes no difference if it was committed to disk now or 5 seconds in the future. You've still lost the new version of the file (but keep the old version) even if fsync() is called, if the power fails before the fsync() is called and returns. Not using fsync() just makes that window 5 seconds longer in normal operation. If you're saving once every 10 minutes, who cares about 5 seconds? You'll still end up with the old version of the file if you're using a sane filesystem (or nulls if you're using crap like XFS). -- Tim Connors

On Wed, 11 Apr 2012, Tim Connors <tconnors@rather.puzzling.org> wrote:
Please don't (I had to set vim to turn off fsync. I don't have to override emacs, fortunately). Some of us like our drives to remain spun down, regardless of any autosave behaviour of your editor (I could maybe tolerate fsync for hard save and not using fsync for autosave, but I still
IMHO a correctly operating editor would do autosaves under different file names (which AFAIK they ALL do) and would not call fsync() on the autosave file. As you didn't explicitely request the autosave you can't complain if it doesn't make it to disk.
don't like my editor to delay for 5 seconds while it waits for a drive to spin up (or delay for 2 minutes while it waits for all the disk contention caused by bloatware such as mozilla) just because my fingers typed C-x C-s before asking my brain for permission). Don't forget that sometimes (or often, in my case), we're editing over fuse filesystems to laggy ssh connections on the other side of the country. I *really* want filesystem operations to be asyncronous in such cases.
As has been noted there are ways of making this asynchronous. But the default should be that when an editor says it's written then it really is written. On Wed, 11 Apr 2012, Brian May <brian@microcomaustralia.com.au> wrote:
The days you can assume that Unix servers are stored in machine rooms with guaranteed 100% reliable power feeds (e.g. using UPS) are over.
I think that apart from the very early days it's always been a minority of systems that had a UPS. When I was at university in the early 90's most of the Unix systems were workstations or small servers that appeared to be plugged in to regular power points. It is possible that one of the rooms may have had power points connected to a big UPS, but that seems unlikely. Not that a UPS is 100% reliable. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, Apr 11, 2012 at 11:08:41AM +1000, Tim Connors wrote:
As long as the file is either the new or the old version, and not zeroed out because of braindead filesystem behaviour,
you've made this claim a few times about XFS. sometimes naming XFS explicitly, and sometimes not as here. i suspect it's because you don't know or don't understand how XFS works. tl;dr version: XFS does not zero sectors due to brain-dead filesystem behaviour. more detailed version: (dredging up half-remembered stuff from years ago when it was actually relevant or worth knowing): it gives you zeroed sectors if and only if the system crashes when the metadata has been written but the actual data has not, resulting in the file's metadata pointing to some completely arbitrary section of disk containing completely arbitrary data - it might be zeroes, it's likely to be sectors that were previously in use by some other file. it could be an old copy of /etc/shadow, or a confidential file. when xfs detects that this has happened after a crash, then it zeroes the sectors to prevent leakage of potentially confidential or security-sensitive information. when this happens (e.g. due to power-failure or kernel lockup) choosing between the old version and the new version IS NOT AN OPTION and there is no possibility that it could even be an option. The metadata pointing to the old version has already been overwritten, and the new version never got synced to disk. the xfs developers aren't idiots. if it was at all possible to give you either the new version or the old version in this particular situation, then they'd give it to you. they can't, so they err on the side of security and privacy. AFAIK, this situation can only occur today if you override the defaults and force an xfs filesystem to be mounted without barriers. If you do this, you are asking for trouble and have no legitimate cause for complaint. more to the point, ext3/4 will have exactly the same problem in the same situation (hard crash, no write barriers). I'm unsure of whether they bother to zero the sectors or if they just give you whatever data happened to be in the sectors. hopefully the former. newer filesystems, like btrfs and zfs, avoid this problem entirely because they are copy-on-write. craig -- craig sanders <cas@taz.net.au> BOFH excuse #447: According to Microsoft, it's by design

On Sat, 21 Apr 2012, Craig Sanders wrote:
On Wed, Apr 11, 2012 at 11:08:41AM +1000, Tim Connors wrote:
As long as the file is either the new or the old version, and not zeroed out because of braindead filesystem behaviour,
you've made this claim a few times about XFS. sometimes naming XFS explicitly, and sometimes not as here.
i suspect it's because you don't know or don't understand how XFS works.
tl;dr version: XFS does not zero sectors due to brain-dead filesystem behaviour.
Not taking rename() as being an implicit barrier is braindead. ext4 fixed that. I don't believe XFS has for idealogical reasons. Instead of putting a small workaround that causes bugger-all impact on performance in kernel code, they insist that decades of userspace should change its behaviour instead. ...
AFAIK, this situation can only occur today if you override the defaults and force an xfs filesystem to be mounted without barriers. If you do this, you are asking for trouble and have no legitimate cause for complaint.
more to the point, ext3/4 will have exactly the same problem in the same situation (hard crash, no write barriers). I'm unsure of whether they bother to zero the sectors or if they just give you whatever data happened to be in the sectors. hopefully the former.
Except that ext3 on old defaults didn't write metadata out of order with data. -- Tim Connors

On Sat, 21 Apr 2012, Tim Connors <tconnors@rather.puzzling.org> wrote:
Not taking rename() as being an implicit barrier is braindead. ext4 fixed that. I don't believe XFS has for idealogical reasons. Instead of putting a small workaround that causes bugger-all impact on performance in kernel code, they insist that decades of userspace should change its behaviour instead.
Is relying on rename() as a barrier without using fsync(), fdatasync(), or sync() something that many applications do? The case of writing to a new file and then renaming it over the old one isn't such a common application usage pattern. It's used for updates to /etc/shadow etc (I'm sure those programs are solid), by rsync (Tridge is a great coder), and some editors (we have already established that most editors call fdatasync() etc). Where are all these broken applications? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Sat, 21 Apr 2012, Russell Coker wrote:
On Sat, 21 Apr 2012, Tim Connors <tconnors@rather.puzzling.org> wrote:
Not taking rename() as being an implicit barrier is braindead. ext4 fixed that. I don't believe XFS has for idealogical reasons. Instead of putting a small workaround that causes bugger-all impact on performance in kernel code, they insist that decades of userspace should change its behaviour instead.
Is relying on rename() as a barrier without using fsync(), fdatasync(), or sync() something that many applications do?
Um. Yes. http://mail.opensolaris.org/pipermail/zfs-discuss/2009-March/027379.html Believing that, somehow, "metadata" is more important than "other data" should have been put to rest with UFS. Yes, it's easier to "fsck" the filesystem when the metadata is correct and that gets you a valid filesystem but that doesn't mean that you get a filesystem with valid contents. ... As long as POSIX believes that systems don't crash, then clearly there is nothing in the standard which would help the argument on either side. It is a "quality of implementation" property. Apparently, T'so feels that reordering filesystem operations is fine. http://mail.opensolaris.org/pipermail/zfs-discuss/2009-March/027389.html Pragmatically, it is much easier to change the file system once, than to test or change the zillions of applications that might be broken. Fortunately, the transaction groups in zfs seem to take care of write ordering: "AFAIUI, the ZFS transaction group maintains write ordering, at least as far as write()s to the file would be in the ZIL ahead of the rename() metadata updates." I haven't read the rest of the thread yet, but that thread makes for interesting reading. I trust solaris guys when it comes to application and operating system reliability more than I trust some linux people.
The case of writing to a new file and then renaming it over the old one isn't such a common application usage pattern.
Um. OK.
It's used for updates to /etc/shadow etc (I'm sure those programs are solid), by rsync (Tridge is a great coder),
rsync uses fsync? You might want to run your strace again, because I've got news for you. I'm glad it doesn't, because that sure as hell would have slowed down the sync I did of my photos yesterday after running gpscorrelate. -- Tim Connors

On Sat, 21 Apr 2012, Tim Connors wrote:
On Sat, 21 Apr 2012, Craig Sanders wrote:
On Wed, Apr 11, 2012 at 11:08:41AM +1000, Tim Connors wrote:
As long as the file is either the new or the old version, and not zeroed out because of braindead filesystem behaviour,
you've made this claim a few times about XFS. sometimes naming XFS explicitly, and sometimes not as here.
i suspect it's because you don't know or don't understand how XFS works.
tl;dr version: XFS does not zero sectors due to brain-dead filesystem behaviour.
Not taking rename() as being an implicit barrier is braindead. ext4 fixed that. I don't believe XFS has for idealogical reasons. Instead of putting a small workaround that causes bugger-all impact on performance in kernel code, they insist that decades of userspace should change its behaviour instead.
Speaking of braindead filesystems, a few hours ago, I had a nasty power outage that my UPS didn't catch (batteries appear dead despite monitoring tell me they are all A-OK. But it has been 3 years). External disk hosting the backups lost power, but laptop remained alive. Disk was innactive at the time. That's ok, reattach drive, perform a dance with kill and mount, and restart the backup job. And then just moments ago, I got this in my kernel logs: Internal error xfs_btree_check_sblock at line 119 of file /home/blank/debian/kernel/release/linux-2.6/linux-2.6-3.2.4/debian/build/source_amd64_none/fs/xfs/xfs_btree.c Awesome! I hadn't yet gotten around to syncing the FS to my new ZFS installation on the NAS that also lost power but would probably cope with it better. But that's ok, I should be able to run fsck.xfs right? Who writes this shit? -- Tim Connors

Speaking of braindead filesystems, a few hours ago, I had a nasty power outage that my UPS didn't catch (batteries appear dead despite monitoring tell me they are all A-OK. But it has been 3 years).
I've seen UPS's at around 80% of maximum load suddenly decide that they are overloaded the instant the power goes off. Apart from the above, someone decided once upon a time that with redundant PSU's on a server it would be a good idea to plug one PSU into the UPS and one directly into a mains outlet in case the UPS failed. This is stupid and also results in the above situation. James

On Sun, 22 Apr 2012, Tim Connors wrote:
And then just moments ago, I got this in my kernel logs: Internal error xfs_btree_check_sblock at line 119 of file /home/blank/debian/kernel/release/linux-2.6/linux-2.6-3.2.4/debian/build/source_amd64_none/fs/xfs/xfs_btree.c
Awesome! I hadn't yet gotten around to syncing the FS to my new ZFS installation on the NAS that also lost power but would probably cope with it better.
Awesome shit. So xfs_repair wanted it mounted first to replay the log. But mounting it caused an internal error, so it suggested blowing away the log and running xfs_repair on the unmounted filesystem. Did that, a few minor things in lost+found. 24 hours later, Apr 23 03:48:22 dirac kernel: [29301.792677] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line 341 of file /home/blank/debian/kernel/release/linux-2.6/linux-2.6-3.2.4/debian/build/source_amd64_none/fs/xfs/xfs_alloc.c. Caller 0xffffffffa0a6b072 What a fragile filesystem! Do people actually trust their data to it? I've been running xfs for 2 months now. I had been running ext2/ext3/4 trouble free since they bloody well came out. Ext4 before ext4 was marked stable! -- Tim Connors

On 23/04/12 08:18, Tim Connors wrote:
What a fragile filesystem! Do people actually trust their data to it?
a) report problems to the XFS folks please, I've never seen anything like this before with it.. b) you might want to check your hardware out, I'm wondering if you're picking up odd hardware errors that are causing things to go wonky. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

On 23.04.12 08:18, Tim Connors wrote:
What a fragile filesystem! Do people actually trust their data to it? I've been running xfs for 2 months now. I had been running ext2/ext3/4 trouble free since they bloody well came out. Ext4 before ext4 was marked stable!
This may not be a fair question, Tim, but has XFS delivered on whatever promised attribute made you switch from ext[234]? Erik -- "355/113 -- Not the famous irrational number Pi, but an incredible simulation!"

On Mon, 23 Apr 2012, Erik Christiansen wrote:
On 23.04.12 08:18, Tim Connors wrote:
What a fragile filesystem! Do people actually trust their data to it? I've been running xfs for 2 months now. I had been running ext2/ext3/4 trouble free since they bloody well came out. Ext4 before ext4 was marked stable!
This may not be a fair question, Tim, but has XFS delivered on whatever promised attribute made you switch from ext[234]?
Was it faster at making hardlinks and other metadata changes like I thought it would? No, alas. That's why I have mounted the filesystem ro, and am currently rsyncing it across to something with a better proven track record. Hopefully it will finish rsyncing by mid-May! Incidentally, since xfs seems to insist that the smallest error in read-write mode (even though it appeared to be a read error rather than a write error) immediately aborts the filesystem (whereas ext4 gives me the option of errors={continue,remount-ro,panic}), is there a way to convince it not to error out the entire filesystem for a read-only mounted filesystem so I can carry on and retrieve most of it? I assume it will error out in readonly mode when it stumbles across that particular error in the B-tree. -- Tim Connors

\begin{rant} James Harper wrote:
I have a SATA disk in my laptop with 4G of SSD onboard as a cache. It can stay spun down for a while even if writes are done - no need to spin it up again until the cache is nearly full. I'd be focussing more on your storage medium than your filesystem.
"My hardware doesn't exhibit this issue" is not a solution. Due to the... idiosyncratic nature of one of my sites, their only nonvolatile storage is (still) 3.5" floppies. "Everybody has SSD-backed HDDs these days" is not going to help them when an application starts fsyncing at random in the hope of forcing data onto the disks ASAP.
The problem with omitting the fsync in the case of an application is that you are violating your contract with the user. If I save a document then when the application says that the save is complete my document had damn well better be on my memory stick.
That is why there is a manual umount, instead of just "unplug and hope". It is not the application's responsibility to attempt to ensure bits hit non-volatile storage. Users that expect to be able to just unplug a stick and walk away at any time DESERVE to have that expectation violated.
OTOH, sticking fsyncs everywhere in your code just because it seems like a good idea is to be frowned upon...
No argument there. \end{rant}

On 11 April 2012 10:00, Trent W. Buck <trentbuck@gmail.com> wrote:
Users that expect to be able to just unplug a stick and walk away at any time DESERVE to have that expectation violated.
Same thing can happen, if, say, the power unexpectedly failed. The days you can assume that Unix servers are stored in machine rooms with guaranteed 100% reliable power feeds (e.g. using UPS) are over. Even in the city - have had a number of short power "glitches" lately. -- Brian May <brian@microcomaustralia.com.au>

Brian May <brian@microcomaustralia.com.au> wrote:
On 11 April 2012 10:00, Trent W. Buck <trentbuck@gmail.com> wrote:
Users that expect to be able to just unplug a stick and walk away at any time DESERVE to have that expectation violated.
Same thing can happen, if, say, the power unexpectedly failed.
Yes, but then one can't reasonably expect everything to be intact after the power returns either. At least with fsync(), anything saved shortly before the power failure should be on the medium, along with any associated metadata changes, but one can expect a corrupt file system nevertheless.
The days you can assume that Unix servers are stored in machine rooms with guaranteed 100% reliable power feeds (e.g. using UPS) are over.
Even in the city - have had a number of short power "glitches" lately.
As have I out in a Melbourne suburb, but the UPS took care of it. I recently left home with my laptop, logged into my home system via the laptop with ssh and received a message stating that power had just returned. The messages from apcupsd and the system logs showed that a half-hour power outage had taken place shortly after I left.

The problem with omitting the fsync in the case of an application is that you are violating your contract with the user. If I save a document then when the application says that the save is complete my document had damn well better be on my memory stick.
That is why there is a manual umount, instead of just "unplug and hope". It is not the application's responsibility to attempt to ensure bits hit non- volatile storage.
Users that expect to be able to just unplug a stick and walk away at any time DESERVE to have that expectation violated.
Windows users have that expectation and have that expectation met. When Windows tells me my copy is done, my copy is done and I can yank the memory stick. Yanking it in the middle of the copy will obviously cause a problem but that's not what we're talking about here. James

On Wed, 11 Apr 2012, James Harper wrote:
The problem with omitting the fsync in the case of an application is that you are violating your contract with the user. If I save a document then when the application says that the save is complete my document had damn well better be on my memory stick.
That is why there is a manual umount, instead of just "unplug and hope". It is not the application's responsibility to attempt to ensure bits hit non- volatile storage.
Users that expect to be able to just unplug a stick and walk away at any time DESERVE to have that expectation violated.
Windows users have that expectation and have that expectation met. When Windows tells me my copy is done, my copy is done and I can yank the memory stick. Yanking it in the middle of the copy will obviously cause a problem but that's not what we're talking about here.
As far as I'm aware, Windows turns off all write caching on all removalable media to achieve this. It doesn't do the equivalent for internal disks, so you end up with interesting corruption when power fails. Also, you'll note that windows write performance to external media is woeful because of this. I'll stick with unix behaviour, TYVM. -- Tim Connors

On 11/04/12 11:38, Tim Connors wrote:
On Wed, 11 Apr 2012, James Harper wrote:
The problem with omitting the fsync in the case of an application is that you are violating your contract with the user. If I save a document then when the application says that the save is complete my document had damn well better be on my memory stick.
That is why there is a manual umount, instead of just "unplug and hope". It is not the application's responsibility to attempt to ensure bits hit non- volatile storage.
Users that expect to be able to just unplug a stick and walk away at any time DESERVE to have that expectation violated.
Windows users have that expectation and have that expectation met. When Windows tells me my copy is done, my copy is done and I can yank the memory stick. Yanking it in the middle of the copy will obviously cause a problem but that's not what we're talking about here.
As far as I'm aware, Windows turns off all write caching on all removalable media to achieve this. It doesn't do the equivalent for internal disks, so you end up with interesting corruption when power fails.
So do user-friendly distributions of Linux. Insert a FAT-formatted USB stick into a modern Ubuntu box, and it'll automount it for you, with the "flush" option set. According to the man page: flush If set, the filesystem will try to flush to disk more early than normal. Not set by default. So, they're still trying to meet users' expectations that if a "copy" command has finished, then it means their data has been copied. Which, really, is quite fair. Although it seems the Linux approach is a compromise -- it's not totally synchronous behaviour, so performance is probably OK, but the window for data loss if a user yanks the stick or kills the power is reduced. I don't understand the protestations against fsync though! Anything else is totally unfair. If you say "Your data is saved/copied/whatever!" when it isn't on disk, they *the operating system is lying to the user*. Why would that be OK?? -Toby

Toby Corkindale wrote:
On 11/04/12 11:38, Tim Connors wrote:
On Wed, 11 Apr 2012, James Harper wrote:
The problem with omitting the fsync in the case of an application is that you are violating your contract with the user. If I save a document then when the application says that the save is complete my document had damn well better be on my memory stick.
That is why there is a manual umount, instead of just "unplug and hope". It is not the application's responsibility to attempt to ensure bits hit non- volatile storage.
Users that expect to be able to just unplug a stick and walk away at any time DESERVE to have that expectation violated.
Windows users have that expectation and have that expectation met. When Windows tells me my copy is done, my copy is done and I can yank the memory stick. Yanking it in the middle of the copy will obviously cause a problem but that's not what we're talking about here.
As far as I'm aware, Windows turns off all write caching on all removalable media to achieve this. It doesn't do the equivalent for internal disks, so you end up with interesting corruption when power fails.
So do user-friendly distributions of Linux.
FSVO user-friendly = idiot / Windows refugee friendly.
I don't understand the protestations against fsync though! Anything else is totally unfair. If you say "Your data is saved/copied/whatever!" when it isn't on disk, they *the operating system is lying to the user*. Why would that be OK??
We have registered post and normal post, and registered post has guaranteed delivery. Obviously we should make ALL post registered, because otherwise you never really know if the letter arrived at all!

On Wed, 11 Apr 2012, Trent W. Buck wrote:
I don't understand the protestations against fsync though! Anything else is totally unfair. If you say "Your data is saved/copied/whatever!" when it isn't on disk, they *the operating system is lying to the user*. Why would that be OK??
We have registered post and normal post, and registered post has guaranteed delivery. Obviously we should make ALL post registered, because otherwise you never really know if the letter arrived at all!
Let's ban UDP! ;P -- Tim Connors

Toby Corkindale wrote:
On 11/04/12 11:38, Tim Connors wrote:
On Wed, 11 Apr 2012, James Harper wrote:
The problem with omitting the fsync in the case of an application is that you are violating your contract with the user. If I save a document then when the application says that the save is complete my document had damn well better be on my memory stick.
That is why there is a manual umount, instead of just "unplug and hope". It is not the application's responsibility to attempt to ensure bits hit non- volatile storage.
Users that expect to be able to just unplug a stick and walk away at any time DESERVE to have that expectation violated.
Windows users have that expectation and have that expectation met. When Windows tells me my copy is done, my copy is done and I can yank the memory stick. Yanking it in the middle of the copy will obviously cause a problem but that's not what we're talking about here.
As far as I'm aware, Windows turns off all write caching on all removalable media to achieve this. It doesn't do the equivalent for internal disks, so you end up with interesting corruption when power fails.
So do user-friendly distributions of Linux.
FSVO user-friendly = idiot / Windows refugee friendly.
With attitudes like that, the lack of Linux desktop market penetration continues to not surprise me. The average computer user doesn't care about your ideals of fsync and unmounting disks before removing them. They just want something that works. Calling them idiots says more about you than it does about them. James

James Harper wrote:
So do user-friendly distributions of Linux.
FSVO user-friendly = idiot / Windows refugee friendly.
With attitudes like that, the lack of Linux desktop market penetration continues to not surprise me.
If you are under the impression that I give a flying fuck about linux on the desktop, you are wrong. Except insofar as developers use it as an excuse to make my life harder by introducing horrible shit that seeps it ways into my servers, like say dbus or polkit or upstart. I am not an advocate, I am not interested in advocacy and AFAIK we haven't been discussing it.
The average computer user doesn't care about your ideals of fsync and unmounting disks before removing them. They just want something that works. Calling them idiots says more about you than it does about them.
It sounds like what you call "I want it to just work" is what I would call "I want it to just work even though I didn't read the manual". If that's your target market, I would recommend you include a hardware interlock, such that the user *can't* remove a mounted drive, rather than trying to solve it purely in software by permanently crippling performance. This approach (hardware interlocks) can be found on Macintosh 3.5" floppy drives and (I suppose) on slot-loading optical drives.

James Harper <james.harper@bendigoit.com.au> wrote:
With attitudes like that, the lack of Linux desktop market penetration continues to not surprise me.
Consider the following pertinent quote from http://linux.oneandoneis2.org/LNW.htm "The point is to make Linux the best OS that the community is capable of making. Not for other people: For itself. The oh-so-common threats of "Linux will never take over the desktop unless it does such-and-such" are simply irrelevant: The Linux community isn't trying to take over the desktop. They really don't care if it gets good enough to make it onto your desktop, so long as it stays good enough to remain on theirs. The highly-vocal MS-haters, pro-Linux zealots, and money-making FOSS purveyors might be loud, but they're still minorities." The entire article is worth reading and the above statements should be taken in context. Undoubtedly it's controversial, but I think it offers real insight into the nature of the Linux community, then and now.

Tim Connors wrote:
As far as I'm aware, Windows turns off all write caching on all removalable media to achieve this. It doesn't do the equivalent for internal disks, so you end up with interesting corruption when power fails.
Would those be the "internal disks" in hot-swappable SATA bays? ;-)

Tim Connors wrote:
"And avoiding fsync() in applications will also be helpful, since a cache flush operation will force the SSD to write to an erase block even if it isn’t completely filled."
(if you can't tell, I have a pet hate for people that insist fsync() is the only true way to ensure data integrity. Use a reliable filesystem instead!)
Have you met my good friend libeatmydata? It's an LD_PRELOAD wrapper that turns fsync, sync into noops.

On 2012-04-05 15:17, Toby Corkindale wrote:
On 05/04/12 13:46, Brett Pemberton wrote:
A home system has N number of 1.5TB drives, running in RAID5. At one point, these drives stopped becoming available, so the last time I extended the array, I used a 2TB drive. Now that a 1.5TB drive has failed, I'm replacing it with another 2TB drive, and wondering the best way to use the remainder 500GB Is there a better plan?
If two disks have failed and aren't available commercially any more, I'd say it's likely the rest will go sooner rather than later because they're all getting too old.
Consider buying some more 2TB disks (at $125 a pop they're not dear), and then building a new array.
This time, build it with ZFS (or maaaaaybe btrfs if you dare), as with those you can add more disks (of variable size) later and rebalance files.
Also note that mdadm(1) states: Grow Grow (or shrink) an array, or otherwise reshape it in some way. Currently supported growth options including changing the active size of component devices and changing the number of active devices in RAID levels 1/4/5/6, changing the RAID level between 1, 5, and 6, changing the chunk size and layout for RAID5 and RAID5, as well as adding or removing a write-intent bitmap. I've never tried the above, so don't know how reliable it is, but you may be game enough to give it a go. -- Regards, Matthew Cengia

On Thu, Apr 5, 2012 at 3:27 PM, Matthew Cengia <mattcen@gmail.com> wrote:
On 2012-04-05 15:17, Toby Corkindale wrote:
On 05/04/12 13:46, Brett Pemberton wrote:
A home system has N number of 1.5TB drives, running in RAID5. At one point, these drives stopped becoming available, so the last time I extended the array, I used a 2TB drive. Now that a 1.5TB drive has failed, I'm replacing it with another 2TB drive, and wondering the best way to use the remainder 500GB Is there a better plan?
If two disks have failed and aren't available commercially any more, I'd say it's likely the rest will go sooner rather than later because they're all getting too old.
Consider buying some more 2TB disks (at $125 a pop they're not dear), and then building a new array.
This time, build it with ZFS (or maaaaaybe btrfs if you dare), as with those you can add more disks (of variable size) later and rebalance files.
Also note that mdadm(1) states:
Grow Grow (or shrink) an array, or otherwise reshape it in some way. Currently supported growth options including changing the active size of component devices and changing the number of active devices in RAID levels 1/4/5/6, changing the RAID level between 1, 5, and 6, changing the chunk size and layout for RAID5 and RAID5, as well as adding or removing a write-intent bitmap.
I've never tried the above, so don't know how reliable it is, but you may be game enough to give it a go.
Works fine, I've grown this array from being 3x1.5TB drives to its current state of 7x, incrementing by one each time. Never an issue. However, I believe Toby was stressing the point of being able to do this with drives of varying size, which would help out in this current situation, where to do this with mdadm, I need to carve off a partition to suit the existing array, rather than just being able to throw the entire 2TB drive at the array, and have it magically deal with it. / Brett

On 05/04/12 15:30, Brett Pemberton wrote:
On Thu, Apr 5, 2012 at 3:27 PM, Matthew Cengia <mattcen@gmail.com <mailto:mattcen@gmail.com>> wrote:
On 2012-04-05 15:17, Toby Corkindale wrote: > On 05/04/12 13:46, Brett Pemberton wrote: > > A home system has N number of 1.5TB drives, running in RAID5. > > At one point, these drives stopped becoming available, so the last time > > I extended the array, I used a 2TB drive. > > Now that a 1.5TB drive has failed, I'm replacing it with another 2TB > > drive, and wondering the best way to use the remainder 500GB > > Is there a better plan? > > > If two disks have failed and aren't available commercially any more, I'd > say it's likely the rest will go sooner rather than later because > they're all getting too old. > > Consider buying some more 2TB disks (at $125 a pop they're not dear), > and then building a new array. > > This time, build it with ZFS (or maaaaaybe btrfs if you dare), as with > those you can add more disks (of variable size) later and rebalance files.
Also note that mdadm(1) states:
Grow Grow (or shrink) an array, or otherwise reshape it in some way. Currently supported growth options including changing the active size of component devices and changing the number of active devices in RAID levels 1/4/5/6, changing the RAID level between 1, 5, and 6, changing the chunk size and layout for RAID5 and RAID5, as well as adding or removing a write-intent bitmap.
I've never tried the above, so don't know how reliable it is, but you may be game enough to give it a go.
Works fine, I've grown this array from being 3x1.5TB drives to its current state of 7x, incrementing by one each time. Never an issue.
However, I believe Toby was stressing the point of being able to do this with drives of varying size, which would help out in this current situation, where to do this with mdadm, I need to carve off a partition to suit the existing array, rather than just being able to throw the entire 2TB drive at the array, and have it magically deal with it.
Exactly. I think you can only get away with that if the system has filesystem-level information - which is why btrfs and zfs can do it, but it doesn't work at a block-device level with mdadm. Toby

On Thu, Apr 05, 2012 at 03:38:55PM +1000, Toby Corkindale wrote:
Exactly. I think you can only get away with that if the system has filesystem-level information - which is why btrfs and zfs can do it, but it doesn't work at a block-device level with mdadm.
and even they expect disks of the same size. e.g. if you have 4x1TB drives in raidz-1 vdev(*) and replace them one-by-one with 2TB drives then ZFS won't increase the capacity of the pool until *ALL* four have been replaced. OTOH if you have just individual drives in your pool (similar to raid-0), then it recognises the increased capacity as soon as you add the new drive or replace an old one with it. or if you are replacing a mirror vdev (similar to raid-1) it recognises the increased size as soon as both mirrors are replaced. you can also add a single drive to an existing pool. (*) zfs pools ("zpools") are made up of one or more vdevs ("virtual devices") which consist of one or more physical devices. multiple physical devices in a vdev can be configured as mirrors, or as raid-z{1,2,3} (the final digit refers to the number of parity drives in the vdev, so raidz-1 is like raid5 and raidz-2 is like raid6). multiple vdevs in a pool are handled like raid-0 layered over the vdevs. so, e.g., two raidz-1 vdevs in a pool are like raid-50. and two vdevs with a single drive each are like raid-0. craig -- craig sanders <cas@taz.net.au> BOFH excuse #93: Feature not yet implemented

On 05/04/12 15:30, Brett Pemberton wrote:
However, I believe Toby was stressing the point of being able to do this with drives of varying size
I don't believe you can do this with btrfs, if you define a RAID1 array (for instance) with different size drives you will get ENOSPC when you fill the smaller one as it can no longer distribute to both. I think this happens with both RAID0 and RAID10 also (there is no RAID5/6 support merged yet). cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

On Thu, Apr 5, 2012 at 5:42 PM, Trent W. Buck <trentbuck@gmail.com> wrote:
Brett Pemberton wrote:
Works fine, I've grown this array from being 3x1.5TB drives to its current state of 7x, incrementing by one each time.
7 disks with only 1 parity? I hope you keep good backups.
Yep, The drives from my previous array (500GB models) are in a similar array on an old system, which backs up essential data from this array, and documents non-essential data, so it can be re-created if necessary. If I lost two drives on this array, there would be no tears. / Brett

On Thu, 5 Apr 2012, Toby Corkindale <toby.corkindale@strategicdata.com.au> wrote:
On 05/04/12 13:46, Brett Pemberton wrote:
A home system has N number of 1.5TB drives, running in RAID5. At one point, these drives stopped becoming available, so the last time I extended the array, I used a 2TB drive. Now that a 1.5TB drive has failed, I'm replacing it with another 2TB drive, and wondering the best way to use the remainder 500GB Is there a better plan?
If two disks have failed and aren't available commercially any more, I'd say it's likely the rest will go sooner rather than later because they're all getting too old.
I've got a bunch of servers with RAID-1 arrays as small as 20G. There's one server that I'd really like to reinstall but it has been running on a pair of 20G disks since 2006 without a break. http://etbe.coker.com.au/2008/10/14/some-raid-issues/ http://etbe.coker.com.au/2012/02/06/reliability-raid/ That said, I wouldn't be using RAID-5 for large storage that's important. Even RAID-1 has issues. It's best to use something like BTRFS or ZFS for reliability. On Thu, 5 Apr 2012, Chris Samuel <chris@csamuel.org> wrote:
I don't believe you can do this with btrfs, if you define a RAID1 array (for instance) with different size drives you will get ENOSPC when you fill the smaller one as it can no longer distribute to both.
Surely if you tell it to use RAID-1 and give it more than 2 disks then it can somehow work things out? -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

Brett Pemberton wrote:
Now that a 1.5TB drive has failed, I'm replacing it with another 2TB drive, and wondering the best way to use the remainder 500GB
IMO, best practice is to leave it unused until *all* 1.5s are replaced with 2s, at which point you just tell mdadm to grow to use the extra 500 on each disk. (If you have a partition table, you probably will have to fiddle that first.)

On Thu, Apr 5, 2012 at 5:38 PM, Trent W. Buck <trentbuck@gmail.com> wrote:
Brett Pemberton wrote:
Now that a 1.5TB drive has failed, I'm replacing it with another 2TB drive, and wondering the best way to use the remainder 500GB
IMO, best practice is to leave it unused until *all* 1.5s are replaced with 2s, at which point you just tell mdadm to grow to use the extra 500 on each disk. (If you have a partition table, you probably will have to fiddle that first.)
That's one way of doing it, yes. However I do need somewhere to store my OS, which currently isn't on this array (historical reasons). And as I said, I'm not really interested in a long-term solution of replacing 1.5TB drives with 2TB drives. There just isn't any big gains in it. My previous run of replacing drives was replacing 500GB ones with 1.5TB ones. So by that pattern, even going to 3TB drives is a step down. / Brett

For the public record ... On Thu, Apr 5, 2012 at 1:46 PM, Brett Pemberton <brett.pemberton@gmail.com>wrote:
sfdisk -d /dev/sdg | sfdisk /dev/sdh mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdh1 missing pvcreate /dev/md1 vgextend system /dev/md1 pvmove -v /dev/sdg1
Missed a step here, vgreduce -v /dev/sdg1
pvremove /dev/sdg1 mdadm --manage /dev/md1 --add /dev/sdg1
mount /, chroot into it, run grub-install
For some stupid reason I had done the above from a debian i686 rescue flash drive. So naturally when I went to chroot, I couldn't, since this is an amd64 system. So instead I hand edited my grub.cfg and inserted 'insmod raid' lines above all 'insmod lvm' lines for kernel images. Worked absolutely fine, all back up and running. And thanks for all the suggestions. / Brett

On Sat, 7 Apr 2012, Brett Pemberton <brett.pemberton@gmail.com> wrote:
For some stupid reason I had done the above from a debian i686 rescue flash drive. So naturally when I went to chroot, I couldn't, since this is an amd64 system.
For future reference, if you create your i686 rescue disk with an amd64 kernel as a boot option then things will all work as you desire. An amd64 kernel works fine with i686 user-space (I've done that many times) and you can then chroot to an amd64 environment. The advantage of doing this is that if your flash device is too small to contain all the utilities etc for a second installation of Linux (both i686 and amd64) but big enough to have an extra kernel then you can have everything work on i686 and amd64 systems. Brett, you probably already know this, but I think it'll be useful to some other readers. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/
participants (13)
-
Anthony Shipman
-
Brett Pemberton
-
Brian May
-
Chris Samuel
-
Craig Sanders
-
Erik Christiansen
-
James Harper
-
Jason White
-
Matthew Cengia
-
Russell Coker
-
Tim Connors
-
Toby Corkindale
-
Trent W. Buck