
On Wed, 2 Jul 2014 12:34:48 Noah O'Donoghue wrote:
On 1 July 2014 12:29, Peter Ross <Petros.Listig@fdrive.com.au> wrote:
The development seems to be independent from +Sun/Oracle these days. I am not aware of active contributions from Oracle but I am not 100% sure.
I think this rules out ZFS for me... The main thing in ZFS's favor was to have it backed by sun, but if it's been forked and is going off in it's own direction then that kinda puts it on equal footing with btrfs from my perspective.
I disagree. Working code doesn't suddenly stop working when support disappears. The current ZFS code will continue working just as well as it currently does for the forseeable future, and it's currently working better than any other filesystem by most objective measures. ZFS doesn't seem suitable for a Linux root filesystem but that doesn't have much impact on it's utility.
To address Russell's comment on ECC RAM, I think I'm going to take the position that it's probably taking it a bit too far, at least until I see some research on non-ECC memory causing bit-rot on checksummed file systems. I tend to think faulty ram is going to become obvious and not hide beneath the surface, and result in symptoms like the kernel panics that Russell experienced.
Memory errors can cause corruption anywhere. While it can cause checksum failures or metadata inconsistency (which is what I've seen) it can also cause data corruption.
Also, if I am going to error check memory then why stop at the file server? It means I have to have ECC memory in all clients that touch the data, including mobile devices, to cater for data corruption in RAM being written to disk.
For a memory corruption to corrupt stored data it has to miss corrupting anything that will cause a system crash or application SEGV. While it is possible for a memory corruption to affect kernel data structures and make an application write to the wrong file it would be less likely to corrupt kernel data structures and not crash something. The most likely case when a client has memory corruption is that it will only affect files that you are deliberately writing to. For example while reading mail via IMAP it's conceivable that an error might cause the deletion of a recent message you wanted to keep (just an index error on which message to delete). But it's very unlikely that your archive of mail from last year will be corrupted. Filesystem corruption could affect entire sub-trees. On Wed, 2 Jul 2014 13:08:42 Brian May wrote:
I have seen at least one computer where memory errors were resulting in silent corruption of files. This going to be the new file server, but fortunately I noticed random seg faults occurring before deploying. Didn't get any kernel panics. In fact I already had Samba up and running, and it seemed fine. Didn't initially realize files were silently being corrupted until later on in the debugging process (from memory).
A couple of years ago I was given a bunch of old AMD64 computers for free. I installed Linux on a PentiumD system from that batch and I got lots of SEGVs from applications for no good reason (EG "gzip < /dev/urandom | gzip -d | gzip | gzip -d" would get a SEGV fairly quickly). Then I ran debsums and discovered that about 1% of files installed had checksum mismatches (real errors verified by putting the disk in another PC). That was a fairly extreme case and I'm sure that there are lots of other systems with similar errors that occur less frequently. As an aside the RAM from that system worked perfectly in another system, so it would be a CPU or motherboard problem. But it does show that electronic problems can cause data loss. On Wed, 2 Jul 2014 14:14:29 Noah O'Donoghue wrote:
In any case I'm limited to non-ECC ram by the form factor of my bookshelf..
There's not much you can do about that right now. But small servers with ECC RAM that would fit your shelf should appear soon enough.
I wonder if better hardware tests would be an area worth looking into, for example, monthly online memory/CPU tests, etc?
Debian has a package named "memtester" that might suit your requirements in that regard. One problem with it is that memory errors aren't always random, so an error that happens to always hit a bit of RAM that contains kernel buffers wouldn't be found.
I wonder also if there are deterministic tests we can proactively do to catch corruptions at a higher level, for example scanning any file type that includes a checksum eg .zip for corruptions and comparing to previous runs.
You could do that. You could have a list of checksums of your files and verify them, maybe like tripwire. But BTRFS and ZFS do enough checks internally for this.
It seems the problem is uncaught hardware failure. If we minimize the window the failure is unknown then we increase the change of being able to compare the source to backups and recover information.
True. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/