What should I monitor/log and how should I monitor/log to determine why my headless NAS is often becoming unavailable?
The problem:
- Another machine that depends on the NAS routinely has its services unavailable because the NFS mounts are no longer mounted.
- When that happens, sometimes a
sudo mount -a
recovers them. - Other times, the NAS is not pingable, so I go to the physical host, plug in monitor/keyboard and find that I can’t log in. The login screen is frozen, requiring hard reboot.
- Often when I leave a monitor attached (VGA), I come back to a screen that says:
critical medium error, dev sda, sector 163776752 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
I started a sudo smartctl -t long /dev/sda
a few hours ago, and sometime since then, the server depending upon it no longer had NFS mounted. But a simple sudo mount -a
resolved.
What the server was also doing when it had a network blip:
rclone
was backing up to backblaze b2- Acting as NFS server for Plex/*arr media server
- Acting as NFS storage for Proxmox machine (but no VMs or CTs running)
Pasted some zpool
output below. Details about the machine:
-
Repurposed old hardware, just built this Debian 12 NAS a couple months ago
-
Operates as backup destination for other machines
-
Operates as media location for my Plex machine - other server that mounts the NAS via NFS.
-
P6X58D-E LGA 1366 motherboard, Intel X5670 CPU, 18 GB (3x4GB, 3x2GB triple channel)
-
8 hard drives connected to
LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
-
10GbE to managed TP-Link switch through one port on
Mellanox Connectx-3 MCX312A-XCBT EN
➜ sudo zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT nvr 5.45T 3.35T 2.10T - - 2% 61% 1.00x ONLINE - tank 70.9T 34.4T 36.5T - - 0% 48% 1.00x ONLINE -
➜ sudo zpool status -v pool: nvr state: ONLINE scan: scrub repaired 0B in 08:49:40 with 0 errors on Sun Nov 12 09:13:41 2023 config:
NAME STATE READ WRITE CKSUM nvr ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 6T-75LN0J4 ONLINE 0 0 0 6T-95A2PNV ONLINE 0 0 0
errors: No known data errors
pool: tank
state: ONLINE scan: scrub repaired 1M in 16:44:16 with 0 errors on Sun Nov 12 17:08:27 2023 config:
NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 12T-5PGJ4A0D ONLINE 0 0 0 12T-Z2J26EBT ONLINE 0 0 0 12T-5PGHSZJC ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 14T-9KG38U5L ONLINE 0 0 0 14T-9KG81HRL ONLINE 0 0 0 14T-9RGG5ZDC ONLINE 0 0 0
errors: No known data errors
I’ve had drive failures bring down entire systems. Replace
sda
and see if the problems continue.Fair enough! Going to start with memtest, per another comment, and narrow things down one at a time - probably by removing sda next.