After months of effort, I finally resolved the issues when configuring a Cisco 9800 VM pair with High Availability (HA) on VMware ESXi. Cisco’s documentation, while comprehensive, falls short when it comes to distributed switching—a crucial component for anyone leveraging VMware vSphere.
The primary issue I encountered was persistent failures of the HA redundancy ports, accompanied by keepalive error messages on the console. Initially, the setup would function for a few minutes or hours, but eventually, the keepalive retry counter would max out, causing the HA to failover.
Interestingly, when the virtual machines (VMs) were hosted on the same physical server, they operated flawlessly with identical settings. However, this setup defeats the purpose of redundancy. Having two virtual Wireless LAN Controllers (WLCs) with HA on the same host is far from ideal, as it undermines the very essence of redundancy.
One of the Cisco documents did mention the following:When testing two 9800-CL controllers in the same Cisco UCS® server and using RP ports for HA, it is not necessary to connect the physical RP mapped physical adapters at all. However, if active and standby 9800-CL controllers are on separate hypervisors, the RP mapped physical ports need to be connected to the network and must be Layer 2 adjacent and reachable by each other.
So to test this, I setup standard vswitches on two hosts and using VM affinity rules to pin the WLC-A and B vm’s to two specific hosts. The vswitches were cabled directly to each other for the RP portgroup. Essentially creating a back to back cable between the two WLCs. The other ports are using the distributed switch (management and data).
This alone didn’t fix the problem. It was suggested to me by Cisco TAC to set the keepalive timer and retries to 10 each as below:

That solved the issue, redundancy has not failed since! For reference we are running Dell servers with VMware vSphere 8.0u3 (latest patch). The fabric switches are Cisco Nexus with multiples of 10gbit.
If you would like to read the Cisco support forums post regarding this issue it is here. My post is the last one where I originally tried to complete the configuration with distributed switching but then reverted as per the TAC recommendation.
Comments