Hi everyone,

I have been experiencing some weird problems lately, starting with the OS becoming unresponsive “randomly”. After reinstalling multiple times (different filesystems, tried XFS and BTRFS, different nvme slots with different nvme drives, same results) I have narrowed it down to heavy IO operations on the nvme drive. Most of the time, I can’t even pull up dmesg, and force shutdown, as ZSH gives an Input/Output error no matter the command. A couple of times I was lucky enough for the system to stay somewhat responsive, so that I could pull up dmesg.

It gives a controller is down, resetting message, which I’ve seen on archwiki for some older Kingston and Samsung nvmes, and gives Kernel parameters to try (didn’t help much, they pretty much disable aspm on pcie).

What did help a bit was reverting a recent bios upgrade on my MSI Z490 Tomahawk, causing the system to not crash immediately with heavy I/O, but rather mount as ro, but the issue still persists. I have additionally run memtest86 for 8 passes, no issues there.

I have tried running the lts Kernel, but this didn’t help. The strange thing is, this error does not happen on Windows 11.

Has anyone experienced this before, and can give some pointers on what to try next? I’m at my wits end here. EDIT: When this issue first appeared, I assumed the Kioxia drive was defective, which the manufacturer replaced after. This issue still happens with the new replacement drive too, as well as the Samsung drive. I thus assume, that neither drives are defective (smartctl also seems to think so)

Here are hardware and software details:

  • Arch with latest Zen Kernel, 6.7.4, happened with other, older kernels too though, tried regular, lts and zen
  • BTRFS on LUKS
  • i9-10850k
  • MSI z490 Tomahawk
  • GSkill 3200 MHz RAM, 32GB, DDR4
  • Samsung 970 Evo 1TB & Kioxia Exceria G2 1TB (tested both drives, in both slots each, over multiple installs)
  • Vega 56 GPU
  • Be quiet Straight Power 11 750W PSU
  • xan1242
    link
    fedilink
    arrow-up
    3
    ·
    11 months ago

    The only thing I can think of is to try the drives in a different system and see how they behave (same OS and configuration).

    If they behave the same then that rules out everything except the drives themselves and the OS.

    Considering how you mentioned the behavior is better in Windows, it sounds like a software issue, but you never know until you try.

    • rotopenguin@infosec.pub
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      11 months ago

      The other way to look at it is to stick the drives into a usb enclosure. That gets you away from the PC’s 3v3 rail. If you then hang the drive enclosure off of a powered hub/dock, you are definitely way outside of the PC’s power supply problems.

      Here’s one that I have, hopefully it’s still made halfway good. https://www.amazon.com/gp/product/B08G14NBCS/

      • xan1242
        link
        fedilink
        arrow-up
        1
        ·
        11 months ago

        Not a bad idea actually, totally didn’t think about that.

    • Krait@discuss.tchncs.deOP
      link
      fedilink
      arrow-up
      1
      ·
      11 months ago

      Unfortunately I have no other system at hand at the moment that’s able to accept nvme drives :( I could try using windows for a couple of days see whether the issue is really linux-related, but I am trying to avoid that lol

      • xan1242
        link
        fedilink
        arrow-up
        1
        ·
        11 months ago

        Maybe even a PCIe pass through to a VM could do the trick if you’re desparate lol (with Linux living in a separate drive)

        Orrrr maybe even try FreeBSD… (or mac OS, but eww gross don’t test that)