Linux Clusterfuck: Or, How To Configure Swaylock To Run on ACPI Events

12 February 2023

in which i scream

note: skip to here if you just want the solution without all the drama

I got back from a not-so-relaxing trip early this past Wednesday morning. I obviously went right tf to bed, got up around noon and puttered around for a bit, then finally opened up my laptop (running Arch) and ran a paru -Syu. After a few days of having Other Priorities there were quite a few packages to install, but no worries; I monitored the output as it ran and everything looked fine. And it was! Until. . . .

it was not fine

My suspend was broken :) Or rather, swaylock was broken, likely due to a bug in a recent update to my Wayland compositor, Hyprland. If I closed my laptop, waited for it to suspend, and opened it back up, I was greeted with a swaylock screen that was completely frozen. If I was lucky, I could tty into restarting my display manager and logging back in. If I wasn't, Hyprland itself crashed and I could only force reboot.

Now, I hear what you're saying, Linux oldhats. First of all, I should have backed my system up before updating so many packages all at once. Preferably, I should also have been using a backup utility built for Arch, rather than (yikes) logging into an X11 Plasma session just to use a very janky and unreliable Timeshift (built for Ubuntu). And to you I say: you're absolutely correct full stop I've learned my lesson and am taking recomendations for Arch friendly backup software (preferably ones with a GUI or very helpful man pages). Unfortunately, in this case, due to entirely my own hubris and laziness, I was stuck without a recent (< 2 days) backup to rely upon.

The second obvious response to "it was a Hyprland bug screwing with swaylock" is "how on earth did you determine that?" and that question has a much more hellish interesting answer.

Mostly, this post is for my own benefit 1 month - 3 years from now when I have to go spelunking in the systemd and /etc/modprobe caverns again to fix some other basic functionality. But if you're here reading with me, maybe you'll learn a bit about what ~not~ to do when troubleshooting.

it's always NVIDIA optimus

Anyone that's bought a budget gaming/creative laptop in the past ~5 years has signed the contract with the devil that is NVIDIA Optimus.

the TRADE OFFER meme. i receive: approximately one grand; you receive: a steady stream of GPU bugs. bottom text: nvidia optimus babey

For those not familiar, NVIDIA Optimus is where your laptop comes with two GPUS: an integrated (usually Intel) GPU that's great for battery life and not much else but is generally well-behaved, and a dedicated GPU (NVIDIA) for graphically intensive tasks that is poorly behaved, does not work, causes only headaches, and is also extremely necessary for eg games or screensharing.

Because NVIDIA does shit its own very proprietary way, Linux support for NVIDIA chips is always a headache. I'm told it's much better than it was ~5yrs ago, and to that I say, Jesus, it really must have been terrible.

I've been around this rodeo before, though, so off I went a-googling and a-journald-ing.

kernel params

First off, let's check what NVIDIA has to say on the subject. They've got an entire book on NVIDIA-on-Linux with a relevant section on suspend. They say to check that your GPU supports it (mine does) and then check the kernel params NVreg_EnableS0ixPowerManagement and make sure it's set to 1. Mine was set to 0! Surely, that was the issue, and we can wrap it up here. Right?

systemd hell

of course not. That did absolutely nothing. Next up: make sure the relevant systemd services are running. i enabled and restarted nvidia-resume.service, nvidia-suspend.service, and nvidia-hibernate.service. I even discovered a bug with nvidia-powerd.service in the process, and disabled and masked it so I could play Hollow Knight without my logs being spammed (as much, lol). No juice.

At this point, I went through my journald logs more painfully. There was no "ERROR: Process xyz dumped core" or "ERROR: Your lock utility fucked up" screaming at me, so whatever it was was failing silently. A smart, experienced unix dev probably would have started looking at recently upgraded packages at this point. I am not that user.

So, I started messing with my lock service itself. Here's the entirety of the systemd service I was using to lock my screen, lock.service:

[Unit]
Description=Lock the screen upon resume from suspend

[Service]
User=[my_username]
Environment=WAYLAND_DISPLAY=wayland-1
Environment=XDG_RUNTIME_DIR=/run/user/1000
Environment=DISPLAY=:1
ExecStart=/usr/bin/swaylock
ExecStartPost=/usr/bin/sleep 1

[Install]
WantedBy=suspend.target

This does pretty much what it says it does. Systemd is, among other things, a service manager for Linux systems which controls and allows configuration of daemons/services/other systemy things I don't yet fully understand. This file (if configured correctly and enabled) literally just tells the system "when suspend happens, execute swaylock on display 1."

First, I tried switching the suspend.target to sleep.target. Would not reccomend this. Instead of being able to switch over to a terminal session and restart my display manager, this crashed my entire system and I had to force-reboot to get back into my session.

So, clearly the target was not the issue, or at least changing it was creating larger issues. I decided to do a sanity check and change the execution from /usr/bin/swaylock to echo "i am locked!". Luckily this did output a lovely i am locked! happily to my journald logs every time I closed my laptop lid and caused a suspend event. And lo and behold -- when I cracked my laptop open again, the user session was NOT frozen!

It was at this point when I started actually putting the pieces together. If systemd wasn't putting anything stranger than usual in my logs on suspend/sleep events, and if my system was fine (just unlocked and therefore insecure) when I executed any other command from lock.service. . . could it be? that my issue was not, in fact, with suspend at all?

except when it isn't NVIDIA optimus

So I went digging in swaylock to see if that was the issue. I re-installed swaylock; realized I'd been using swaylock-effects the whole time, so reinstalled that; realized there were a couple different git repos for swaylock-effects which did not seem entirely reliably maintained, so switched back to swaylock; checked the most recent commits to the swaylock repo since I'd been encountering the issue, saw absolutely no changes committed in the past week; and concluded probably, swaylock wasn't the issue.

Somewhere in here, I also had the brilliant idea of disabling my lock.service, executing swaylock from a terminal, and then closing my laptop lid. This worked perfectly, no freezing whatsoever. A less bonkers lesser consumer of buggy software than I would have called it quits here and simply executed their lock service manually from here on out. But no, I had to have it automated, so on I went.

At this point, it didn't seem like the issue was with swaylock, especially as nobody else was having the issue on the swaylock github. That left the probable culprit to Hyprland, the very-pretty but deeply-still-in-development Wayland compositor (aka window manager for those of us who aren't pedants) I use. And lo and behold--- when I checked the Hyprland github, someone 12 hours before had created an issue for what looked like the same bug.

A saner smarter person than me would have waited for Hyprland devs to respond with a potential patch, which they did within ~24 hours of that issue appearing on GH. But I still wasn't confident of this being the source of the issue, with all the different moving pieces, none of which were giving me very much to go on log-wise.

james' beautiful workaround

Note: I do not make any promises as to the security of below method. My threat model is essentially "some guy who doesn't know computers breaks into my house and picks up my laptop, which is not open." I do not take this laptop to public places and then leave it around for people to try to break into. If that's your threat model, swaylock is probably not a good bet for you, as swaylock is known to have / have had quite nonideal practices when crashing (such as simply allowing access to your session lol)

ALRIGHTY, strap in.

My brilliant strategy is to hook into the "lid close" ACPI event, rather than the sleep event via systemd. The first go-to for "I can't figure out why my systemd service isn't working and I need to hook into an ACPI event" is acpid, a very old, very fiddly util that allows you to define handlers for a variety of ACPI events. This works fine except. . . we need access to the user-level Wayland session (Hyprland, in my case), and acpid as a root level process has a harder time with that. If you try to execute a user-level graphical process such as kitty or swaylock from a script executed by acpid, you get a "Failed to open X11 display" error, meaning acpid doesn't know how to find the display to render to, or some such.

Enter (dramatic drumroll) user-acpid-rs, a companion project to acpid with exactly two stars on Gitlab, one of which is from me, yesterday, at about 3am. The first small mercy is that user-acpid is small, just a single file. The second is it's written in extremely readable fashion in Rust, which I've been spending the last few weeks at Recurse picking up.

If you're a Rustacean, into systems programming, or even if you are deeply neither, I'd go give user-acpid a look -- it's simple, readable, and does exactly what it says on the tin. If you want to customize it to other ACPI events, it also looks extensible, though for my purposes the lid close handler is fine.

Install acpid, then install user-acpid, the latter preferably in a root dir (eg /usr/bin/ not ~/.local/ ) though cargo will make this difficult and you will have to work around it and no, sudo cargo install my-acpid-rs will not cut it, even if you were silly enough to try sudoing a package manager which I would certainly never do. (I think I ran cargo build --release and then just copied the relevant executable to somewhere in my $PATH.)

Next, copy the user-acpid.service file from the user-acpid repo into ~/.config/systemd/user/ and customize the referenced install location as needed.

FINALLY, create a bash or zsh or whatever-shell-you-use script in ~/.config/user-acpid/ named button-lid-close and add whatever you want to execute on lid close to that file. In my case that's:

#!/bin/zsh
DISPLAY=:1 WAYLAND_DISPLAY=wayland-1 swaylock

chmod +x that baby, run:

systemctl daemon-reload

systemctl enable --now acpid.service

systemctl enable --user --now user-acpid.service (< the flags are important on this one)

and you should be GOOD to GO. Slap that screen down, open her up, and you should not only see swaylock up and running, you should also be able to interact and unlock your session. Wild stuff.

takeaways

I honestly spared you (future me) a lot of the drama here, which is remarkable considering the length of this post. At one point, in order to figure out why some of my keyboard keys were't working (see below), I live-booted EndeavorOS from USB and went down a rabbit hole of figuring out how the files in /sys/ get generated. Between the "maybe it's not NVIDIA" and "definitely it's not NVIDIA" states, there was also a long series of "perhaps if I toggled this boot parameter" and "what about this tweak to the configuration for /etc/systemd/system/lock.service" rabbitholes that got me nowhere. Some useful info from those rabbitholes is listed below, in case I come back to this in a year with the same poor instincts.

Part of this whole kerfuffle was just my own imposter syndrome around unix system maintenance. I have a mistaken belief everything that goes wrong on my system is my fault for not reading carefully, for being more comfortable with a GUI for some tasks, etc. It seems like this was simply a bug introduced into my Wayland compositor a couple days ago. The lesson here is "back up freuqently and well, and install packages incrementally to a functioning system when things go wrong" not "you made a fatal mistake configuring your system."

I've been running exclusively different flavors of linux for several years, but until now mostly stuck with the "it just works" distros with built-in display environments rather than customizing my system much. Baseline Pop!OS and Manjaro fuck, and are so much better than Windows that I didn't even bother to figure out whether there was something else I'd like something better. My current laptop's combo of hardware is very fidgety and has a large community of people dedicated to making it less so at asus-linux, for whom I am very grateful; they recommend Arch, which is why I switched. Despite really loving the setup I have now, I'm still in the growing pains moving from "KDE just handles that for me" to "chasing GH issues across 5 different repos and finagling systemctl for multiple hours."

Obviously, hopefully Hyprland devs will patch this regression in behavior and all will be well and I will no longer need my workaround. Barring someone writing me about how insecure my current process is, though, I'm quite pleased with this workaround. It sets me up for handling any number of other ACPI events via custom hooks and doesn't rely on the as-established extreme unreliability of NVIDIA suspend/sleep/hibernate services.

what not to do

In the process of sorting this all out, I did a bunch of whackamole with boot options via grub and modprobe (when I still believed the issue to be NVIDIA suspend related) that various characters on stackoverflow et al recommended. None of this helped and much of it hurt. If you have a NVIDIA and/or asus machine, I would highly recommend NOT adding the following lines to any /etc/modprobe.d configuration files:

# !! DO NOT DO THIS 
blacklist hid-asus 
blacklist asus-nb-wmi 
blacklist nvidiafb 
blacklist rivafb 
blacklist i2c_nvidia_gpu

After a reboot, my meta key (🪟) just....fully stopped working. Unfortunately this key was bound to almost every functionality of my WM/compositor, so I freaked out thinking I'd nerfed my entire system. Eventually I decided out that my meta key must simply have broken, and remapped my WM bindings to alt. I happily configured my swaylock as described above and was mildly irritated by a brand new laptop with a broken key, but mostly just happy to have a functional, locking system again.

Then I realized the backlight brightness keys weren't working. Nor the keyboard brightness keys. And when i went digging around for /sys/class/backlight/ to see what stuff was set to, I found there was no directory by that name. And in fact, most of /sys/ was empty of the expected files. Long story short asus-nb-wmi is a non-negotiable kernel module and it should not be blacklisted don't do it.

Hope you learned something if you made it this far, and to future me: just wait for the patch next time. . .It's not worth it!!