Re: How to make systemd more reliable

On Tuesday, 11 October 2016 10:30:01 PM AEDT Craig Sanders via luv-main wrote:
I was rebooting anyway in order to replace a failed SSD on one machine and convert both of them to root on ZFS. It booted up OK on both, so I made it the default. If it refrains from sucking badly enough to really piss me off for a decent length of time, i'll leave it as the default.
That's a bold move. While ZFS has been totally reliable in preserving data I have had ongoing problems with filesystems not mounting when they should. I don't trust ZFS to be reliable as a root filesystem, I want my ZFS systems to allow me to login and run "zfs mount -a" if necessary.
(this is, of courtse, one of the reasons I dislike journald. logs should be plain text, so you can access them without specialised tools. and rsyslogd wasn't running in the semi-broken recovery mode that systemd dumped me in after making me wait 1m30s until it finished doing mysterious things (AFAICT, it was doing nothing except twiddling some stars on screen).
I agree that those things need to be improved. There should be a way to get to a root login while the 90 second wait is happening. There should be an easy and obvious way to display those binary logs from the system when it's not running systemd or from another system (IE logs copied from another system). -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Tue, Oct 11, 2016 at 11:13:34PM +1100, russell@coker.com.au wrote:
On Tuesday, 11 October 2016 10:30:01 PM AEDT Craig Sanders via luv-main wrote:
I was rebooting anyway in order to replace a failed SSD on one machine and convert both of them to root on ZFS. It booted up OK on both, so I made it the default. If it refrains from sucking badly enough to really piss me off for a decent length of time, i'll leave it as the default.
That's a bold move.
switching my main system to systemd? yes, i know. very bold. very risky. i'll probably regret it at some point. :)
While ZFS has been totally reliable in preserving data I have had ongoing problems with filesystems not mounting when they should.
i know other people have reported problems like that, but it's never happened on any of my zfs machines...and i've got most of my pools plugged in to LSI cards (IBM M1015 reflashed to IT mode) using the mp2sas driver - which is supposed to exacerbate the problem due to the staggered drive spinup it does. the only time i've ever seen something similar was my own stupid fault, i rebooted and just pulled out the old SSD forgetting that I had ZIL and L2ARC for the pools on that SSD. I had to plug the old SSD back in before I could import the pool, so i could remove them from the pool (and add partitions from my shiny new SSDs to replace them).
I don't trust ZFS to be reliable as a root filesystem, I want my ZFS systems to allow me to login and run "zfs mount -a" if necessary.
not so bold these days, it works quite well and reliably. and i really want to be able to snapshot my rootfs and backup with zfs send rather than rsync. anyway, i've left myself an ext4 mdadm raid-1 /boot partition (with memdisk and a rescue ISO) in case of emergency. the zfs root on my main system is two mirrored pairs (raid-10) of crucial mx300 275G SSDs(*). slightly more expensive than a pair of 500-ish GB but much better performance....read speeds roughly 4 x SATA SSD read (approximating pci-e SSD speeds), write speeds about 2 x SATA SSD. i haven't run bonnie++ on it yet. it's on my todo list. http://blog.taz.net.au/2016/10/09/converting-to-a-zfs-rootfs/ the other machine got a pair of the same SSDs, so raid-1 rather than raid-10. still quite fast (although i'm having weirdly slow scrub performance on that machine. haven't figured out why yet. peformance during actual usage is good, noticably better than the single aging SSD I replaced). (*) 275 marketing GB. SI units. 256 GiB in real terms. they're good value for money anyway....i got mine for $108 each. I've since seen them for $97 (itspot.com.au). MSY doesn't stock them for some reason (maybe they want to clear their stock of MX200 models first). we're just on the leading edge of some massive drops in price/GB. a bit earlier than I was predicting, i though we'd start seeing it next year. wont be long before 2 or 4TB SSDs are affordable for home users (you can get 2TB SSDs for around $800 now). and then I can replace some of my HDD pools.
I agree that those things need to be improved. There should be a way to get to a root login while the 90 second wait is happening.
so there really is no way to do that? i was hoping it was just some trivially-obvious-in-hindsight thing that i didn't know. it's really annoying to have to wait and watch those damn stars when you just wnat to get a shell and start investigating & fixing whatever's gone wrong.
There should be an easy and obvious way to display those binary logs from the system when it's not running systemd or from another system (IE logs copied from another system).
yep. can you even access journald logs if you're booted up with a rescue disk? (genuine question, i don't know the answer but figure it's one of the things i need to know) craig -- craig sanders <cas@taz.net.au>

There should be an easy and obvious way to display those binary logs from the system when it's not running systemd or from another system (IE logs copied from another system).
After scp'ing the logs to the remote machine: journalctl --file copied.log works, assuming it hasn't been encrypted etc.
yep. can you even access journald logs if you're booted up with a rescue disk? (genuine question, i don't know the answer but figure it's one of the things i need to know)
If the journal is being kept on disk, then yes, journalctl --directory /var/journal/ will work. I'm not terribly happy with the current implementation of journal. Having a structured, fast, searchable, all in one journal is a lovely goal however. -- -- Clinton Roy, Software Engineer with Bloomberg L.P.

On 12/10/16 01:31, Craig Sanders via luv-main wrote: ...
it's really annoying to have to wait and watch those damn stars when you just wnat to get a shell and start investigating & fixing whatever's gone wrong.
There should be an easy and obvious way to display those binary logs from the system when it's not running systemd or from another system (IE logs copied from another system).
On my todo list. It happens when I boot after failing to alter fstab to match the actual disks connected.
yep. can you even access journald logs if you're booted up with a rescue disk? (genuine question, i don't know the answer but figure it's one of the things i need to know)
Good question - went and looked: systemctl -b -1 will pull up the log from the boot before the rescue boot. --list-boots will give you the history in the database.

On 12/10/16 10:42, Allan Duncan via luv-main wrote:
On 12/10/16 01:31, Craig Sanders via luv-main wrote: ...
it's really annoying to have to wait and watch those damn stars when you just wnat to get a shell and start investigating & fixing whatever's gone wrong.
There should be an easy and obvious way to display those binary logs from the system when it's not running systemd or from another system (IE logs copied from another system).
On my todo list. It happens when I boot after failing to alter fstab to match the actual disks connected.
Look at man systemd-udevd The default 180 sec would work, but seems to be retriggered in practise. There are some kernel params to tweak also, but they seem to be intended to prolong not shorten.

On Wed, Oct 12, 2016 at 10:42:03AM +1100, Allan Duncan wrote:
On my todo list. It happens when I boot after failing to alter fstab to match the actual disks connected.
it seems to happen at the slightest excuse, whether the machine ends up booting or not. just what everyone needs, one or more 90 second delays during the boot process. maximising downtime by putting long delays in the way of whoever's trying to fix the problem is not a good idea. is this delay configurable to something more reasonable, like 30 or 15 or even 5 seconds? can it be disabled? can i at least have a --stop-wasting-my-fucking-time option? (that would be more useful than their deprecated --kill-my-kernel-please, aka --debug)
yep. can you even access journald logs if you're booted up with a rescue disk?
Good question - went and looked: systemctl -b -1 will pull up the log from the boot before the rescue boot.
--list-boots will give you the history in the database.
neither of those work. that would be because they're journalctl options, not systemctl. also, the default for the Storage setting in /etc/systemd/journald.conf (at least in debian) is "auto", so no persistent storage of journal unless you manually create /var/log/journal (which, of course, you wouldn't do until *after* you realise you need to and only after you've figured out what needs to be done. bad default) craig -- craig sanders <cas@taz.net.au>

On 12/10/16 13:48, Craig Sanders via luv-main wrote:
On Wed, Oct 12, 2016 at 10:42:03AM +1100, Allan Duncan wrote:
On my todo list. It happens when I boot after failing to alter fstab to match the actual disks connected.
it seems to happen at the slightest excuse, whether the machine ends up booting or not. just what everyone needs, one or more 90 second delays during the boot process.
maximising downtime by putting long delays in the way of whoever's trying to fix the problem is not a good idea.
is this delay configurable to something more reasonable, like 30 or 15 or even 5 seconds? can it be disabled?
can i at least have a --stop-wasting-my-fucking-time option?
(that would be more useful than their deprecated --kill-my-kernel-please, aka --debug)
yep. can you even access journald logs if you're booted up with a rescue disk?
Good question - went and looked: systemctl -b -1 will pull up the log from the boot before the rescue boot.
--list-boots will give you the history in the database.
neither of those work. that would be because they're journalctl options, not systemctl.
Indeed - I am always mixing the two.
also, the default for the Storage setting in /etc/systemd/journald.conf (at least in debian) is "auto", so no persistent storage of journal unless you manually create /var/log/journal (which, of course, you wouldn't do until *after* you realise you need to and only after you've figured out what needs to be done. bad default)
Fedora has it commented out, so I guess that means the default is "persistent" 'cause I have them waaay back. After some online searching I have found the fstab options nofail and x-systemd.device-timeout= given in the systemd.mount man page. Another tidbit: https://gryzli.info/2016/06/18/systemd-systemctl-list-unit-files-timeouts/ [refers to https://github.com/systemd/systemd/issues/1961 ]

On Wednesday, 12 October 2016 1:31:33 AM AEDT Craig Sanders via luv-main wrote:
the only time i've ever seen something similar was my own stupid fault, i rebooted and just pulled out the old SSD forgetting that I had ZIL and L2ARC for the pools on that SSD. I had to plug the old SSD back in before I could import the pool, so i could remove them from the pool (and add partitions from my shiny new SSDs to replace them).
Did you have to run "zfs import" on it or was it recognised automatically? If the former how did you do it? Is the initramfs configured to be able to run zfs import?
not so bold these days, it works quite well and reliably. and i really want to be able to snapshot my rootfs and backup with zfs send rather than rsync.
BTRFS snapshots are working well on the root filesystems of many systems I run. The only systems I run without BTRFS as root are systems where getting console access in the event of problems is too difficult.
the zfs root on my main system is two mirrored pairs (raid-10) of crucial mx300 275G SSDs(*). slightly more expensive than a pair of 500-ish GB but much better performance....read speeds roughly 4 x SATA SSD read (approximating pci-e SSD speeds), write speeds about 2 x SATA SSD.
i haven't run bonnie++ on it yet. it's on my todo list.
If you had 2*NVMe devices it would probably give better performance than 4*SATA and might be cheaper. That would also leave more SATA slots free.
we're just on the leading edge of some massive drops in price/GB. a bit earlier than I was predicting, i though we'd start seeing it next year. wont be long before 2 or 4TB SSDs are affordable for home users (you can get 2TB SSDs for around $800 now). and then I can replace some of my HDD pools.
It's really changing things. For most users 2TB is more than enough storage even for torrenting movies. I think that spinning media is going to be mostly obsolete for home use soon.
I agree that those things need to be improved. There should be a way to get to a root login while the 90 second wait is happening.
so there really is no way to do that? i was hoping it was just some trivially-obvious-in-hindsight thing that i didn't know.
https://wiki.debian.org/systemd#Debugging According to the above Wiki page you can give a kernel command-line parameter for a root shell. However that uses tty9 (which isn't available on serial consoles), has no password (you can't have it enabled all the time with hostile console access), and has to be switched on in advance. I don't know if there is a way to do what you and I want. But there is a way to do something that will get us by in most situations.
it's really annoying to have to wait and watch those damn stars when you just wnat to get a shell and start investigating & fixing whatever's gone wrong.
Absolutely.
There should be an easy and obvious way to display those binary logs from the system when it's not running systemd or from another system (IE logs copied from another system).
yep. can you even access journald logs if you're booted up with a rescue disk? (genuine question, i don't know the answer but figure it's one of the things i need to know)
From a quick read of the man page it appears that the -D option to journalctl might do what we want. It appears that Debian has moved to not having the binary journals so I don't have a conveniant source of test data.
-- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, Oct 12, 2016 at 11:18:40AM +1100, russell@coker.com.au wrote:
On Wednesday, 12 October 2016 1:31:33 AM AEDT Craig Sanders via luv-main wrote:
the only time i've ever seen something similar was my own stupid fault, i rebooted and just pulled out the old SSD forgetting that I had ZIL and L2ARC for the pools on that SSD. I had to plug the old SSD back in before I could import the pool, so i could remove them from the pool (and add partitions from my shiny new SSDs to replace them).
Did you have to run "zfs import" on it or was it recognised automatically? If the former how did you do it?
after plugging the old SSD back in? can't remember for sure, but i think so....it wasn't imported before i rebooted again so wouldn't have been automatically imported after reboot. I probably did something like: zpool import -d /dev/disk/by-id/ <poolname>
Is the initramfs configured to be able to run zfs import?
yes, i have zfs-initramfs installed.
BTRFS snapshots are working well on the root filesystems of many systems I run. The only systems I run without BTRFS as root are systems where getting console access in the event of problems is too difficult.
yes, but you can't pipe `btrfs send` to `zfs recv` and expect to get anything useful. my backup pool is zfs. and so far, i've had 100% success rate (2/2) with zfs rootfs. Disclaimer: not a statistically significant sample size. contents may settle during transport. void where prohibited by law. serving suggestion only. batteries not included.
crucial mx300 275G SSDs(*). slightly more expensive than a pair of 500-ish GB but much better performance....read speeds roughly 4 x SATA SSD read (approximating pci-e SSD speeds), write speeds about 2 x SATA SSD.
i haven't run bonnie++ on it yet. it's on my todo list.
If you had 2*NVMe devices it would probably give better performance than 4*SATA and might be cheaper. That would also leave more SATA slots free.
yes, that would certainly be a LOT faster. can't see any way it could be cheaper. i'd have to get a more expensive brand of ssd plus i'd need an nvme pci-e card or two. However, I have SATA ports in abundance. On the motherboard, I have 6 x SATA III (4 used for the new SSDs, two previously used for the old SSDs but now spare) plus another 2 x 1.5Gbs SATA, and some e-sata which i've never used. In PCI-e slots, I have 16 x SAS/SATA3 on two IBM 1015 LSI cards (8 ports in use, 4 spare and connected to hot-swap bays, 4 spare and unconnected). PCI-e slots are in very short supply. and my m/b doesn't have any nvme sockets. If I could find a reasonably priced PCI-e 8x NVMe card that actuAlly supported two PCI-e NVMe drives (instead of 1 x pci-e nvme + 1 x sata nvme), i'd probably have swapped out the spare/unused M1015 cards for it. i don't have any spare 4x slots. so i did what I could to maximise performance with the hardware I have. everything I do on the machine is noticably faster, including compiles and docker builds etc. but yeah, eventually I'll move to PCI-e NVME drives. sometime after my next motherboard & cpu upgrade. I'm waiting to see real-world reviews and benchmarks on the upcoming AMD Zen CPU. Intel has some very nice (and expensive) high-end CPUs, but their low-end and mid-range CPUs are more expensive than old AMD CPUs without offering much improvement....might make sense for a new system, but not as an upgrade. Every time I look into switching to Intel, it turns out I'll have to spend around $1000 to get roughly similar performance to what I have now with a 6 year old AMD CPU. I'm not going to spend that kind of money without a really significant benefit. I could get an AMD FX-8320 or FX-8350 CPU for under $250 but I'd rather wait for Zen and get a new motherboard with PCI-e 3.0 and other new stuff too. Just going on past history, I'm quite confident that will be significantly cheaper and better than switching to Intel...i expect around $400-$500 rather than $800-$1000.
we're just on the leading edge of some massive drops in price/GB. a bit earlier than I was predicting, i though we'd start seeing it next year. wont be long before 2 or 4TB SSDs are affordable for home users (you can get 2TB SSDs for around $800 now). and then I can replace some of my HDD pools.
It's really changing things. For most users 2TB is more than enough storage even for torrenting movies.
btw, for torrenting on ZFS, you need to create a separate dataset with recordsize=16K (instead of the default 128K) to avoid COW fragmentation. configure deluge or whatever to download to that and then move the finished torrent to another filesystem. probably same or similar for btrfs.
I think that spinning media is going to be mostly obsolete for home use soon.
yep. and good riddance. i'd still want to buy them in pairs, for RAID-1/RAID-10 (actually, ZFS mirrored pairs)
i'll read through that (and the fedora one that it links to) before rebooting my myth box with systemd again. it was an unintentional reboot anyway. i'd used grub-set-default intending to reboot to systemd next time, but the thunderstorm caused a few second power-outage and the UPS for that machine died ages ago (haven't replaced it yet). was busy with other stuff and didn't even notice it was down for a few hours.
From a quick read of the man page it appears that the -D option to journalctl might do what we want. It appears that Debian has moved to not having the binary journals so I don't have a conveniant source of test data.
looks like Storage=auto in /etc/systemd/journald.conf, but /var/log/journal isn't created by default. so no persistent journal by default. bad default. debian configuring it to use rsyslogd by default is a good thing, but doesn't help when the system won't boot far enough to get rsyslogd running. should be on by default. maybe even automatically turn off journald's persistence as soon as rsyslogd (or whatever external logger) successfully starts up. craig -- craig sanders <cas@taz.net.au>

On Wednesday, 12 October 2016 2:46:01 PM AEDT Craig Sanders via luv-main wrote:
yes, but you can't pipe `btrfs send` to `zfs recv` and expect to get anything useful. my backup pool is zfs.
In the early days the plan was to have btrfs receive not rely on BTRFS, so you could send a snapshot to a non-BTRFS filesystem. I don't know if this is a feature they continued with.
If you had 2*NVMe devices it would probably give better performance than 4*SATA and might be cheaper. That would also leave more SATA slots free.
yes, that would certainly be a LOT faster. can't see any way it could be cheaper. i'd have to get a more expensive brand of ssd plus i'd need an nvme pci-e card or two.
Last time I was buying there wasn't much price difference between SATA and NVMe devices. Usually buying 2 medium size devices is cheaper than 4 small devices.
PCI-e slots are in very short supply. and my m/b doesn't have any nvme sockets.
That's a problem for you then. Also most motherboards don't support booting from NVMe at the moment, there are many ways of working around this (chaining boot from CD, USB, etc) but it's an annoyance.
Intel has some very nice (and expensive) high-end CPUs, but their low-end and mid-range CPUs are more expensive than old AMD CPUs without offering much improvement....might make sense for a new system, but not as an upgrade. Every time I look into switching to Intel, it turns out I'll have to spend around $1000 to get roughly similar performance to what I have now with a 6 year old AMD CPU. I'm not going to spend that kind of money without a really significant benefit.
I keep getting such great systems from e-waste. Mostly Intel CPUs because there seems to be a correlation between people who buy Intel CPUs and people who dispose of systems after a few years. I'll try and get some more for LUV members.
It's really changing things. For most users 2TB is more than enough storage even for torrenting movies.
btw, for torrenting on ZFS, you need to create a separate dataset with recordsize=16K (instead of the default 128K) to avoid COW fragmentation. configure deluge or whatever to download to that and then move the finished torrent to another filesystem.
probably same or similar for btrfs.
The early versions of BTRFS used a 4K allocation block size. Recent versions (the version in Debian/Jessie and maybe before) use a 16K allocation block size. There is no way for BTRFS to use the kind of large block allocation that ZFS does in that regard. This sucks for many use cases but means you don't lose anything for torrenting in this example.
i'd still want to buy them in pairs, for RAID-1/RAID-10 (actually, ZFS mirrored pairs)
The failure modes of SSD are quite different to the failure modes of spinning media. I expect it will be some years before there is adequate research into how SSDs fail and some more years before filesystems develop to work around them. ZFS and WAFL do some interesting things to work around known failure modes of spinning media, they won't be as reliable on SSD as they might be because of the spinning media optimisation. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/

On Wed, Oct 12, 2016 at 03:35:34PM +1100, russell@coker.com.au wrote:
On Wednesday, 12 October 2016 2:46:01 PM AEDT Craig Sanders via luv-main wrote:
yes, but you can't pipe `btrfs send` to `zfs recv` and expect to get anything useful. my backup pool is zfs.
In the early days the plan was to have btrfs receive not rely on BTRFS, so you could send a snapshot to a non-BTRFS filesystem. I don't know if this is a feature they continued with.
nope. from what i've read, they originally intended to make it tar compatible but tar couldn't do what they needed, so they dropped that idea.
Last time I was buying there wasn't much price difference between SATA and NVMe devices. Usually buying 2 medium size devices is cheaper than 4 small devices.
right, but there's a difference between the price of Crucial SSDs and Samsung or Intel. There's also a price difference between having to buy a pci-e nvme card and not having to buy one.
PCI-e slots are in very short supply. and my m/b doesn't have any nvme sockets.
That's a problem for you then.
well, yes, of course it is. we're talking about my system here, and why I choose 4 cheap SATA SSDs rather than two pci-e SSDs.
i'd still want to buy them in pairs, for RAID-1/RAID-10 (actually, ZFS mirrored pairs)
The failure modes of SSD are quite different to the failure modes of spinning media. I expect it will be some years before there is adequate research into how SSDs fail and some more years before filesystems develop to work around them. ZFS and WAFL do some interesting things to work around known failure modes of spinning media, they won't be as reliable on SSD as they might be because of the spinning media optimisation.
I'd still use some kind of raid-1/mirroring anyway, no matter what kind of drives I had. raid isn't a substitute for backups, but it does reduce the risk that you'll need to restore from backup (and the downtime and PITA-factor that goes along with restoring) also, there's no way for ZFS to correct any detected errors if there's no redundancy. i don't mind paying double for storage. it's a bit painful at purchase time, but that's quickly forgotten. and a lot less painful than the time and hassle required to restore from backup, and losing everything new or modified since the previous backup (nightly, but that's still up to a full day's worth of stuff that could be lost. Now that i've got rootfs on ZFS, I can snapshot frequently and backup more often with zfs send) craig -- craig sanders <cas@taz.net.au>

On Wednesday, 12 October 2016 4:03:27 PM AEDT Craig Sanders via luv-main wrote:
The failure modes of SSD are quite different to the failure modes of spinning media. I expect it will be some years before there is adequate research into how SSDs fail and some more years before filesystems develop to work around them. ZFS and WAFL do some interesting things to work around known failure modes of spinning media, they won't be as reliable on SSD as they might be because of the spinning media optimisation.
I'd still use some kind of raid-1/mirroring anyway, no matter what kind of drives I had. raid isn't a substitute for backups, but it does reduce the risk that you'll need to restore from backup (and the downtime and PITA-factor that goes along with restoring)
also, there's no way for ZFS to correct any detected errors if there's no redundancy.
The ZFS copies= feature is designed to operate in single-disk mode. The ZFS metadata has 1 more copy than the data, so even without using copies=2 you will have multiple copies of metadata on a single disk. But whether that does any good on SSD is anyone's guess at the moment.
i don't mind paying double for storage. it's a bit painful at purchase time, but that's quickly forgotten. and a lot less painful than the time and hassle required to restore from backup, and losing everything new or modified since the previous backup (nightly, but that's still up to a full day's worth of stuff that could be lost. Now that i've got rootfs on ZFS, I can snapshot frequently and backup more often with zfs send)
It's a pity that the options for RAID on laptops are so poor. Thinkpads used to have the option of buying a disk bay to replace the CD/DVD drive but I don't know if that is still available and it was always quite expensive. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/
participants (4)
-
Allan Duncan
-
Clinton Roy
-
Craig Sanders
-
Russell Coker