AMD GPU blob crashing

My work computer is a ThinkPad Z13. It’s on most of the time, including overnight and during the weekend. I’m one of those horrible people who like to just wiggle their mouse, unlock, and get working. I often leave a ton of windows open, so I quite like to sit down and start working without having to wait for boot up, and subsequent app launch.

Uprecords

So when I arrive at my desk on a Monday and discover my GPU has crashed, that’s a poor start to the week. The GPU crashing doesn’t completely kill the machine, just my desktop session and all the applications that were open. 😭

I see this kind of thing in the output of dmesg -Tw | grep amdgpu.

[Mon Aug 14 08:06:06 2023] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=5346515, emitted seq=5346517
[Mon Aug 14 08:06:06 2023] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 3456 thread Xorg:cs0 pid 3464
[Mon Aug 14 08:06:06 2023] amdgpu 0000:63:00.0: amdgpu: GPU reset begin!

I use Xorg instead of Wayland on my laptop. I’ve tried Wayland, but it’s never been great for the software I use on a daily basis, and the hardware combination I’m using. I use two external monitors, attached via a USB-C docking thing. So my desk looks a bit like this.

ThinkPad Z13 with two external screens on

Although, more accurately, like this, when the GPU driver dies.

ThinkPad Z13 with two external screens off

This crash happened on a second Monday morning in succession. So I figured it was time to file a bug. I ran ubuntu-bug linux and followed the prompts. That got me a bug 2031289.

Within a couple of days, I got a reply from Juerg Haefliger on the Ubuntu Kernel Team offering this suggestion.

“There are some AMD FW updates in lunar-proposed linux-firmware 20230323.gitbcdcfbcf-0ubuntu1.6. Can you give that a try?”

It’s not a good idea to enable the proposed pocket. So instead, I just grabbed the deb via packages.ubuntu.com, then did the old sudo apt install ./linux-firmware_20230323.gitbcdcfbcf-0ubuntu1.6_all.deb dance to install it.

Four days later, the following Monday, I arrived at the office with all my fingers and toes crossed.

Launchpad comment

Great success 🥳

Juerg followed up asking if we could close the bug. I left it until today, another Monday to make sure, then confirmed. Bug closed!

I’m awarding one hundred Internet points to Juerg for the quick and friendly bug interaction. Plus more points for doing the upload of that package in the first place, according to the changelog.

As I understand it, what I have done here is update a binary blob of GPU firmware on my machine, in the hope that it fixed a crasher. I always understood that the bad, evil, horrible people at nVidia made nasty binary blobs, but the Godlike do-no-wrong people at AMD only made saintly open source stuff.

Seems we still need that horrid non-free stuff, even for the “good” kind of GPU. I went looking for more info and found a thread on Reddit (spit!) from the past, with a post from an AMD person, explaining this situation.

Reddit comment

Today, I learned.