Fixing NFS/SMB Stale File Handle Errors in Proxmox: Automatic Detection Script

I recently wrote an article on setting up NFS on an unRAID server which I use extensively for bulk file storage in my homelab due to unRAID’s ease-of-use and easy storage expansion.

Occasionally though, something happens with the NFS share- either I have to reboot the unRAID server or under heavy I/O- and my NFS clients start reporting “stale file errors” on the NFS mounts. This problem is typically easily fixed with a lazy umount (umount -l /mnt/path/here) on both NFS and SMB.

On my VMs which run Ubuntu, I wrote the following script to automatically umount NFS mounts with stale file handles as a corn job. You can check out my code on Github:

#! /bin/bash
list=$(df 2>&1 | grep 'Stale file handle' | awk '{print ""$2"" }' | tr -d \:)
for directory in $list
do
	echo "Stale file handle on "$(date +"%D %T")" in directory: ""$directory"
	umount -l "$directory"
done

If you’re wondering why I haven’t included an additional mount statement after the umount it’s because I use autofs to automatically mount the NFS share only as needed.

Since Ubuntu is a Debian-based OS, I assumed this code would also work directly on my Proxmox host where I use NFS shares for backup and ISO storage (Proxmox is notorious for these stale file handles). Unfortunately, as you can guess since I’m writing this post, the above script does not appear to work on Proxmox, which I find surprising since the hypervisor uses Ubuntu’s closely related cousin, Debian Buster.

It appears that the reason the above script isn’t working is because, even though the NFS mount has a stale file handle error ls: cannot open directory '.': Stale file handle, df isn’t reporting it (which the script relies on):

I need to investigate further why df is not reporting the NFS stale file handle error in Proxmox. Stay tuned!

Update (6/17/20):

Submitted this question to the Unix StackExchange:

No response so far, but I am wondering if df truly thinks the mount is fine, since it is displaying disk usage statistics for the NFS mount (that doesn’t answer why that’s the case here in Proxmox vs in Ubuntu though). Since I may never receive an answer, I think it’s best to move on and prepare an alternative solution for Proxmox.

Research/Potential Solutions:

Alex, from the Proxmox forums, wrote the following script to combat this issue:

#!/bin/bash
# detect_stale.sh
for mountpoint in $(grep nfs /etc/mtab | awk '{print $2}')
do
	read -t1 < <(stat -t "$mountpoint" 2>&-)
	if [ -z "$REPLY" ] ; then
		echo "NFS mount stale. Removing..."
		umount -f -l "$mountpoint"​
	fi​
done

Additional reference that may come in handy: