I ran into a weird issue today with two hypervisors at a remote site that kept dropping offline, but their VMs remained reachable. I eventually found that as long as I pinged another (unresponsive) address on that subnet, both hosts would respond to pings (and other requests) for ~5 minutes before dropping again.
We went back and forth with the network team and the local IT at the site trying to figure out what was going on, but we weren't really able to make much headway.
👇
While waiting for the network team to have time/energy to investigate it further, I decided to implement one of those temporary fixes that will almost certainly become a permanent solution:
I spun up a quick Kubernetes pod to ping the magic address every 60 seconds until the end of time.
Those hosts haven't gone back down since!