Strange VMware Troubleshooting Session, or Why You Should Follow Best Practices

The Setup

A client called in with an interesting problem. They had recently performed a planned outage due to a power issue at their premises, but upon powering everything back up one of their hosts was unable to access the iSCSI SAN volumes. Fortunately, they were able to bring up all of the VMs on the remaining hosts, but capacity was down enough that they had to disable HA and DRS.

Upon interviewing the client, it became clear that they had already checked all the usual suspects. Nothing in the configuration prior to the power cycling had changed. Switch configurations, cables, ports, port groups, iSCSI settings, etc. were all exactly as before. In desperation, they had even wiped and re-installed the host using ESXi, carefully re-configuring everything to match the working hosts in the hopes that this would resolve the issue. It did not.

The Non-Standard Configuration

We know that VMware and many other experts recommend separating your vSphere network into separate Storage, Management, vMotion, and Production VM networks, typically using VLANs. This client, however, had opted to stay with a flat network configuration with all port groups and vmkernel interfaces configured on the same IP subnet. While this isn’t a Best Practice, there really isn’t anything wrong with doing things like this. As long as the vmknics can talk to the SAN, all should work fine, right?

At some point in the past, however, the decision was made to place an air gap between the storage interfaces and the rest of the network. All of the storage-related physical interfaces were connected to a switch which was not uplinked to the rest of the network in an effort to isolate that traffic so it wouldn’t overload the Production VM and Management traffic. Again, this isn’t a Best Practice, but it should work (and it did for quite some time) configured this way.

Troubleshooting

Where to start? I first plugged my laptop into the storage switch and attempted to ping the IP addresses assigned to the iSCSI vmkernel ports on the troubled host. No pings were returned, yet I was able to successfully ping all other storage interfaces present on the switch. Also, as expected, I was unable to ping all the interfaces connected to the Production/Management switches — a quick check to make sure there wasn’t an uplink between them. This pretty much established what we already knew, but more importantly, laid the ground work for my next test.

Next, I used PuTTY to ssh in to the troubled host and perform vmkpings. Here, I noticed a pattern that led me to my conclusion. I was not able to ping the iSCSI interfaces of the SAN, but when I tried to ping IPs that I knew were only on the Production/Management switches, the pings were returned. This made it clear that the Storage network traffic for the troubled host was exiting the host via the physical interfaces connected to the Management/Production switches and not via the interfaces on the Storage switch.

So what was happening? Upon booting up, that host was binding its iSCSI software initiator to the Management vmkernel port and not the vmkernel ports uplinked to the Storage switch. Since all vmkernel ports are automatically enabled for iSCSI traffic and there is no way to disable iSCSI traffic on a vmkernel port, there was no way to force the iSCSI software initiator to bind to a particular vmkernel port — except to do things right and set them up on separate IP subnets/VLANs.

The Real Solution and The Work-Around

So, of course, the real solution is to re-work the storage network so that it uses a different IP subnet than the production network. This, however, requires planned down time to re-IP all the storage interfaces on all hosts and the SAN. Until that can be planned, they were still down a host and running without HA/DRS. On a hunch, I came up with the following work-around, which got the host back up and running until such time as the reconfiguration could take place:

  1. Power down the troubled host.
  2. Disconnect the network cables serving as uplinks for the Management vmkernel port.
  3. Power up the host, leaving those cables disconnected.
  4. Wait long enough to assure the host had completely booted into ESXi.
  5. Plug the Management vmkernel uplink ports back in.

This worked because the only vmkernel interfaces available while the server was booting were the ones connected to active uplinks — the ones connected to the Storage switch. Once that binding took place, it would not change so it was then safe to plug the Management vmkernel uplinks back in. Obviously not an ideal situation, but it did get the host back in service until a outage window can be scheduled to properly configure the Storage network interfaces.