What should I monitor/log and how should I monitor/log to determine why my headless NAS is often becoming unavailable?

The problem:

  • Another machine that depends on the NAS routinely has its services unavailable because the NFS mounts are no longer mounted.
  • When that happens, sometimes a sudo mount -a recovers them.
  • Other times, the NAS is not pingable, so I go to the physical host, plug in monitor/keyboard and find that I can’t log in. The login screen is frozen, requiring hard reboot.
  • Often when I leave a monitor attached (VGA), I come back to a screen that says:
critical medium error, dev sda, sector 163776752 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2

I started a sudo smartctl -t long /dev/sda a few hours ago, and sometime since then, the server depending upon it no longer had NFS mounted. But a simple sudo mount -a resolved.

What the server was also doing when it had a network blip:

  • rclone was backing up to backblaze b2
  • Acting as NFS server for Plex/*arr media server
  • Acting as NFS storage for Proxmox machine (but no VMs or CTs running)

Pasted some zpool output below. Details about the machine:

  • Repurposed old hardware, just built this Debian 12 NAS a couple months ago

  • Operates as backup destination for other machines

  • Operates as media location for my Plex machine - other server that mounts the NAS via NFS.

  • P6X58D-E LGA 1366 motherboard, Intel X5670 CPU, 18 GB (3x4GB, 3x2GB triple channel)

  • 8 hard drives connected to LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

  • 10GbE to managed TP-Link switch through one port on Mellanox Connectx-3 MCX312A-XCBT EN

    ➜ sudo zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT nvr 5.45T 3.35T 2.10T - - 2% 61% 1.00x ONLINE - tank 70.9T 34.4T 36.5T - - 0% 48% 1.00x ONLINE -

    ➜ sudo zpool status -v pool: nvr state: ONLINE scan: scrub repaired 0B in 08:49:40 with 0 errors on Sun Nov 12 09:13:41 2023 config:

          NAME            STATE     READ WRITE CKSUM
          nvr             ONLINE       0     0     0
            mirror-0      ONLINE       0     0     0
              6T-75LN0J4  ONLINE       0     0     0
              6T-95A2PNV  ONLINE       0     0     0
    

    errors: No known data errors

    pool: tank
    

    state: ONLINE scan: scrub repaired 1M in 16:44:16 with 0 errors on Sun Nov 12 17:08:27 2023 config:

          NAME              STATE     READ WRITE CKSUM
          tank              ONLINE       0     0     0
            raidz1-0        ONLINE       0     0     0
              12T-5PGJ4A0D  ONLINE       0     0     0
              12T-Z2J26EBT  ONLINE       0     0     0
              12T-5PGHSZJC  ONLINE       0     0     0
            raidz1-1        ONLINE       0     0     0
              14T-9KG38U5L  ONLINE       0     0     0
              14T-9KG81HRL  ONLINE       0     0     0
              14T-9RGG5ZDC  ONLINE       0     0     0
    

    errors: No known data errors

  • merkuron@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I’ve had drive failures bring down entire systems. Replace sda and see if the problems continue.

    • Asinafuthimanahahfoo@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Fair enough! Going to start with memtest, per another comment, and narrow things down one at a time - probably by removing sda next.