This was supposed to be a routine upgrade. Two paired sites, both running Site Recovery Manager (SRM) and vSphere Replication Appliance (VRA) on 8.x. Upgrade in place to 9.0.22, then run the convergence wizard to consolidate SRM and VR into a single VMware Live Recovery (VLR) 9.0.5 appliance per site. Apply our internal CA-signed certificates. Done.
The 9.0.22 upgrade was uneventful. Convergence to 9.0.5 was where the trouble started — not because convergence itself is hard, but because a cascade of certificate issues took two days to fully unpick. Some are EasyRSA + modern OpenSSL bumping into each other, some are undocumented appliance behaviour after cert replacement, and at least one is a genuine product bug in 9.0.5 that I’ll be raising with Broadcom.
What follows is the chronological account, with the workarounds that finally got everything working. If you’re staring at “vSphere Replication Management Server could not establish connection to vSphere Replication Server at ‘127.0.0.1:8123′”, or “Error downloading plug-in. No issuer certificate for certificate in certification path found”, or any of the other delightful errors I’m about to describe — this post is for you.
The setup
Two sites, paired for replication and recovery:
vlr-a.example.com(10.0.0.10) — primary sitevlr-b.example.com(10.0.0.20) — DR site
Internal CA via EasyRSA, signing all our internal certificates. The CA root (Internal Root CA) is published into vCenter’s TRUSTED_ROOTS store. Single SSO domain, vsphere.local, with admin user [email protected].
The starting state had SRM and VRA as separate appliances on 8.x at each site. The plan: upgrade to 9.0.22 (which keeps them as separate appliances), then run the converge workflow to combine each pair into a single VLR 9.0.5 appliance per site.
Phase 1: the in-place upgrade to 9.0.22
I won’t dwell on this part because it went fine. ISO mount, run the upgrade in VAMI, wait. The 8.x→9.0.22 in-place upgrade preserves the existing self-signed certs, the site pair stays connected, and replications resume on the other side of the maintenance window. Standard stuff.
The only thing worth flagging is that you should leave the certs alone during this phase. The convergence step that comes next is the right time to think about certs.
Phase 2: the convergence to 9.0.5
The official converge runbook (Broadcom KB 408127) describes the workflow:
- Deploy the new converged appliance with self-signed certs.
- Run the convergence (which migrates SRM + VRA state into the single appliance).
- Reconfigure the appliance via VAMI.
- Verify replication is healthy.
- Configure the isolated network (if you have one).
- Update
hbrsrv-nic.xmlif you’re using a separate replication NIC. - Apply your custom certificate.
- Reconfigure again.
I followed this and convergence ran cleanly. Replication came back healthy on the self-signed certs. Then I applied our EasyRSA-signed certs via VAMI. That’s where things got interesting — and as I’d discover over the next two days, the issues that followed weren’t actually about cert timing. They were a stack of independent product behaviours that surface whenever you swap from VMware-generated self-signed certs to certs from an internal CA, regardless of when you do it.
Phase 3: the local 8123 errors begin
After applying custom certs, the Site Pair view in the vSphere Client showed:
vSphere Replication Management Server could not establish connection to vSphere Replication Server at ‘127.0.0.1:8123’.
On both sites. Plus a few more cryptic ones about remote VRMS at unknown:-1.
For context: in a converged VLR appliance, the Replication Management Server (HMS) and the Replication Server (hbrsrv) live on the same VM and talk to each other over loopback on port 8123 using mutual TLS. When that local mTLS handshake fails, you see this error. The error doesn’t tell you whether it’s a service issue, a network issue, or a cert issue — it just tells you the connection failed.
Searching the exact error string yields KB 312713, which is about ESXi hosts in vCenter inventory having link-local IPv6 addresses (fe80::/10). On a freshly-converged appliance with no protected VMs configured yet, that wasn’t the issue. Wrong KB, useful only as a red herring.
KB 408127 (the converge guide) has a “common issues during convergence” section that mentions cert order, but doesn’t give you specific symptoms to match against. Useful for next time, less useful for now.
Phase 4: actually reading the log
/var/log/vmware/hbr/hbrsrv.log is the actual source of truth. A representative snippet from when HMS tried to connect locally:
2026-XX-XXTXX:XX:XX.XXX+08:00 verbose hbrsrv[XXXXX] [Originator@6876 sub=SessionManager opID=...] Logging by SSL certificate
2026-XX-XXTXX:XX:XX.XXX+08:00 warning hbrsrv[XXXXX] [Originator@6876 sub=Main opID=...] HbrError stack:
2026-XX-XXTXX:XX:XX.XXX+08:00 warning hbrsrv[XXXXX] [Originator@6876 sub=Main opID=...] [0] No client certificate
2026-XX-XXTXX:XX:XX.XXX+08:00 warning hbrsrv[XXXXX] [Originator@6876 sub=Main opID=...] [1] No SSL binding info for the client
2026-XX-XXTXX:XX:XX.XXX+08:00 warning hbrsrv[XXXXX] [Originator@6876 sub=Main opID=...] [2] Error converted to Vmomi fault hbr.replica.fault.NoClientCertificate
hbr.replica.fault.NoClientCertificate is the smoking gun. It means the TCP connection from HMS reached hbrsrv, the TLS handshake got to the CertificateRequest stage, and HMS sent nothing back. Not a wrong cert, not an untrusted cert — no cert at all.
The TLS stack will refuse to present a cert as a client identity if that cert isn’t valid for clientAuth. Time to look at our cert.
openssl x509 -in /etc/vmware/ssl/rui.crt -noout -text | grep -A1 "Extended Key Usage"
Output:
X509v3 Extended Key Usage:
TLS Web Server Authentication
There it is. Server auth only. No client auth. That’s why HMS refused to present it: it’s not valid for clientAuth, so HMS sends nothing, hbrsrv sees a connection with no cert, fault.
But hold on — the CSR I generated explicitly requested serverAuth, clientAuth in the EKU. Why was clientAuth missing from the issued cert?
Phase 5: the EasyRSA --copy-ext bug on OpenSSL 3
EasyRSA 3.0.x has a flag called --copy-ext that is supposed to copy SANs, EKU, and KU from the CSR through to the issued cert. By default, EasyRSA applies its own server template, which has only extendedKeyUsage = serverAuth. With --copy-ext, it should override that with whatever the CSR asked for.
In practice, on EasyRSA 3.0.8 with OpenSSL 3.0+ (i.e. modern Ubuntu 22.04+ and similar), --copy-ext is broken. It injects a config option (copy_extensions = copy) into the wrong section of the temp openssl config it generates, and OpenSSL 3.0 enforces config section validation strictly where 1.1.x silently let it pass. The result:
Error checking certificate extensions from extfile section default
...:error in extension:section=default, name=copy_extensions, value=copy
Easy-RSA error:
signing failed (openssl output above may have more detail)
The signing fails. EasyRSA 3.1+ apparently fixes this, but you can’t easily upgrade EasyRSA on a long-running CA host without disturbing the existing PKI.
Two workarounds:
Workaround 1 — use the serverClient template + EASYRSA_EXTRA_EXTS for SANs.
EasyRSA ships with a serverClient template (in /usr/share/easy-rsa/x509-types/serverClient) that has both serverAuth and clientAuth EKUs. This handles the EKU. For SANs, since --copy-ext is broken, inject them via the env var:
EASYRSA_EXTRA_EXTS="subjectAltName=DNS:vlr-a.example.com,IP:10.0.0.10" \
./easyrsa --vars=./vars sign-req serverClient vlr-a
This works on 3.0.8 + OpenSSL 3.0 because it bypasses the --copy-ext codepath entirely. The EKU/KU come from the template, the SANs come from the env var, and the issued cert has everything you need.
Workaround 2 — sign with raw openssl, bypassing EasyRSA.
Build an extensions file by hand and call openssl x509 -req directly against your CA cert and key:
cat > /tmp/vlr-a.ext <<'EOF'
basicConstraints = CA:FALSE
subjectKeyIdentifier = hash
authorityKeyIdentifier = keyid,issuer:always
extendedKeyUsage = serverAuth,clientAuth
keyUsage = critical,digitalSignature,keyEncipherment,dataEncipherment
subjectAltName = DNS:vlr-a.example.com,IP:10.0.0.10
EOF
openssl x509 -req \
-in pki/reqs/vlr-a.req \
-CA pki/ca.crt \
-CAkey pki/private/ca.key \
-CAcreateserial \
-days 3285 \
-sha256 \
-extfile /tmp/vlr-a.ext \
-out pki/issued/vlr-a.crt
Trade-off: the signed cert isn’t tracked in EasyRSA’s index.txt, so revocation later via ./easyrsa revoke won’t work for these. For internal certs you’ll rotate every couple of years, that’s usually fine.
Either way, verify the issued cert before uploading anything:
openssl x509 -in pki/issued/vlr-a.crt -noout -text | grep -A1 -E "Key Usage|Subject Alternative"
You’re looking for:
X509v3 Extended Key Usage:
TLS Web Server Authentication, TLS Web Client Authentication
X509v3 Subject Alternative Name:
DNS:vlr-a.example.com, IP Address:10.0.0.10
If clientAuth is missing, your cert won’t work for VLR’s mTLS. Re-sign before going anywhere near the appliance VAMI.
Phase 6: thumbprint drift after cert replacement
Re-signed cert applied via VAMI, services restarted, ran Reconfigure. The 8123 errors changed:
[0] Invalid login
[1] Unknown client SSL PEM Certificate
[2] Error converted to Vmomi fault hbr.replica.fault.InvalidLogin
Different fault, different KB. This is KB 404040 territory: hbrsrv and HMS have their own internal trust state stored as VMware guestinfo entries, and after a cert replacement those values can drift apart. HMS presents its current cert; hbrsrv expects a thumbprint that’s stale; rejection.
The two values to check:
/opt/vmware/hbr/bin/hbrsrv-guestinfo.sh get guestinfo.hbr.hbrsrv-thumbprint
/opt/vmware/hbr/bin/hbrsrv-guestinfo.sh get guestinfo.hbr.hms-thumbprint
Path note: KB 404040 says /usr/bin/hbrsrv-guestinfo.sh. On the converged 9.0.5 appliance, the helper script lives at /opt/vmware/hbr/bin/hbrsrv-guestinfo.sh. Older paths give “command not found.” If you’re following the KB literally and getting confused, this is why.
If the values differ, you can manually align them:
/opt/vmware/hbr/bin/hbrsrv-guestinfo.sh set guestinfo.hbr.hms-thumbprint <hbrsrv-thumbprint-value>
systemctl restart hbrsrv
In my case, doing this manually was a temptation but not the right move. Site A had matching thumbprints and was healthy. Site B had two different EasyRSA-signed certs in the picture (an early failed-EKU sign, and a later corrected sign), and the guestinfo values reflected that confusion. The right fix wasn’t to manually align the thumbprints to whatever was loaded — it was to apply the correct cert cleanly via VAMI and let Reconfigure re-establish the trust state.
Lesson: manual hbrsrv-guestinfo.sh set is a sharp tool. If you align to a stale thumbprint, you’ve just locked in the wrong cert as the source of truth. Always check what hbrsrv is actually serving on the wire (s_client -connect 127.0.0.1:8123 -showcerts) and reconcile the guestinfo values to that, not the other way around. Better still: run a clean Reconfigure and let the supported workflow do it for you.
Phase 7: the cross-site reconnect
Local mTLS sorted, time to reconnect the site pair. Click Reconnect, get:
Operation Failed. A generic error occurred in the vSphere Replication Management Server. Exception details: ‘The client did not supply a certificate, or the server is not configured to support client certificates’.
Same root cause as Phase 4, just over the WAN this time. The cross-site mTLS handshake also requires both sides to present clientAuth-capable certs. If either side’s cert lacks clientAuth, the handshake fails identically.
This was mostly already fixed by re-signing properly in Phase 5, but I had to re-apply on both appliances. After both sides had certs with serverAuth + clientAuth EKU, Reconnect succeeded and the Remote VR connection flipped to “Connected” on both sides.
I thought I was done. I was not done.
Phase 8: the vCenter plug-in mystery
Replications were running. Site Pair was happy. But every time I logged into the vSphere Client, one of two things happened:
- The Live Site Recovery plug-in failed to download with: “Error downloading plug-in. Make sure that the URL is reachable and the registered thumbprint is correct. No issuer certificate for certificate in certification path found.”
- Or the plug-in loaded once on a fresh upload, then on logout it would print “Plugin VMware Live Site Recovery:9.0.5.100 has been successfully undeployed” and the next login would fail to re-download.
I had already added the EasyRSA root CA to vCenter’s TRUSTED_ROOTS via the Certificates UI. So the trust should have worked. The error said “No issuer certificate” though, which is Java’s PKIX validator language for “I have a leaf cert that says it was issued by X, but I can’t find X anywhere I’m looking.”
vsphere-ui (the HTML5 client) runs as a Java/Tomcat process. It uses the JVM’s SSL stack, which is strict about chain construction. Modern Java SSL validation expects the server to present the full chain in the TLS handshake. If the server only sends the leaf, even if a trusted root exists in the truststore, the validator can fail to construct the path. This is by-design behaviour in newer JDKs.
So: did the appliance actually serve the chain? Quick test from a third box (not vCenter, not the appliance — somewhere with no preloaded trust):
echo | openssl s_client -connect vlr-a.example.com:443 -showcerts 2>/dev/null | grep -c "BEGIN CERT"
Returned 1. The appliance was serving only the leaf cert. No chain.
When I’d uploaded the cert via VAMI, I’d put the leaf in the cert field and the CA in the chain field, exactly as the UI suggested. VAMI accepted both. But what was actually being served was just the leaf.
Phase 9: where Envoy gets its cert from
VLR 9.0.5 uses Envoy as the front-end proxy for HTTPS traffic on port 443. The config lives at /opt/vmware/etc/envoy/envoy-proxy.listeners.yaml. The relevant block:
- certificate_chain:
filename: /run/credentials/envoy-proxy.service/cert.pem
private_key:
filename: /run/credentials/envoy-proxy.service/key.pem
/run/credentials/envoy-proxy.service/ is a tmpfs created by systemd’s LoadCredential= mechanism. The actual file is injected by systemd at service start from a source elsewhere, and recreated on every restart. Editing the file in /run/credentials/ is futile because it gets blown away.
What’s the source? systemctl cat envoy-proxy.service reveals:
LoadCredential=cert.pem:/opt/vmware/etc/keys/ssl/cert-chain.pem
LoadCredential=key.pem:/opt/vmware/etc/keys/ssl/key.pem
So Envoy reads from /run/credentials/envoy-proxy.service/cert.pem, which systemd populates from /opt/vmware/etc/keys/ssl/cert-chain.pem at service start. That’s the source file.
Check what’s in it:
$ grep -c "BEGIN CERT" /opt/vmware/etc/keys/ssl/cert-chain.pem
1
The file is named cert-chain.pem but contains only the leaf. The VAMI Certificates → Change workflow accepts both a certificate and a CA chain in separate fields, but doesn’t actually concatenate the chain into the file Envoy serves from. Envoy serves the leaf only. Java’s chain validator fails. Plug-in download fails.
This is a real product bug in 9.0.5.
Phase 10: the workaround
Append the CA to the chain file and restart Envoy:
# Backup first
cp /opt/vmware/etc/keys/ssl/cert-chain.pem /opt/vmware/etc/keys/ssl/cert-chain.pem.bak
# Get your CA cert onto the appliance (e.g. via scp into /tmp/, or paste
# the PEM content into a heredoc):
tee /tmp/internal-root-ca.crt > /dev/null <<'EOF'
-----BEGIN CERTIFICATE-----
... your pki/ca.crt content ...
-----END CERTIFICATE-----
EOF
# Append it to the chain file
bash -c 'cat /tmp/internal-root-ca.crt >> /opt/vmware/etc/keys/ssl/cert-chain.pem'
# Verify - should now be 2, with leaf first then root
grep -c "BEGIN CERT" /opt/vmware/etc/keys/ssl/cert-chain.pem
openssl crl2pkcs7 -nocrl -certfile /opt/vmware/etc/keys/ssl/cert-chain.pem | \
openssl pkcs7 -print_certs -noout | grep "subject="
# Restart Envoy to reload the credential
systemctl restart envoy-proxy
Verify externally from a third box:
echo | openssl s_client -connect vlr-a.example.com:443 -showcerts 2>/dev/null | grep -c "BEGIN CERT"
# Expected: 2
echo | openssl s_client -connect vlr-a.example.com:443 -showcerts 2>/dev/null | grep -E "subject=|issuer="
# Expected: leaf subject, leaf issuer (=CA), CA subject, CA issuer (=CA, self-signed)
Then on vCenter, restart vsphere-ui to clear any cached failed-validation state:
service-control --stop vsphere-ui && service-control --start vsphere-ui
This takes 3–5 minutes to fully come back. During the restart, the HTML5 client is unavailable; vCenter itself stays up.
Once it was back, I logged into the vSphere Client. The Live Site Recovery plug-in loaded cleanly. Logged out, logged back in — still loaded. Logged in from a fresh incognito session — still loaded. After a day of “now you see it, now you don’t,” watching it stick across sessions was deeply satisfying.
Same fix on the other site, then re-tested cross-site Reconnect from the Site Pair view. Everything green.
The persistence problem
There’s a catch with the Envoy chain workaround: /opt/vmware/etc/keys/ssl/cert-chain.pem gets regenerated when the VAMI cert workflow runs. So if you (or anyone else) ever runs Certificates → Change again, or runs Reconfigure in a way that touches certs, the appended CA is gone and you’re back to leaf-only on the wire.
Two ways to handle this:
- Documentation. Add this manual append step to your runbook for any cert work on VLR 9.0.5. It’s a 3-command fix once you know it’s needed. The hard part is knowing.
- Patch the script that generates
cert-chain.pem. Somewhere in/opt/vmware/sbin/(or similar) there’s a script that VAMI calls to populate this file from the uploaded cert and chain. If you find it, you can patch it to actually concatenate the chain. More invasive, more durable, breaks if the next product update changes the script.
I went with option 1 for now and have a Broadcom support case open about option 2 from their side. If Broadcom fix it in 9.0.6 or 9.0.7 the workaround becomes unnecessary.
The full lessons-learnt list
1. The issues described here aren’t sequence-dependent. The KB describes a workflow but doesn’t actually require that order. Whether you apply the cert before or after Reconfigure, the bugs below are independent of timing — they fire whenever you swap from VMware-generated self-signed to a cert from an internal CA.
2. Validate cert EKU before applying anywhere. VLR uses mTLS in places you wouldn’t expect (HMS↔hbrsrv on localhost, cross-site VR registration). The cert needs both serverAuth AND clientAuth. A cert with only serverAuth will produce hbr.replica.fault.NoClientCertificate errors that cascade everywhere. Always run openssl x509 -text on the issued cert and grep for Extended Key Usage.
3. EasyRSA 3.0.8 + OpenSSL 3.0+ has a --copy-ext bug. Don’t rely on --copy-ext to bring extensions through from your CSR. Use the serverClient template (which has both EKUs in its template) and inject SANs via EASYRSA_EXTRA_EXTS. Or sign directly with openssl x509 -req and a hand-built extensions file. EasyRSA 3.1+ fixes this, but upgrading a long-running CA host is its own adventure.
4. The hbrsrv-guestinfo.sh path changed in VLR 9.0.5. It’s at /opt/vmware/hbr/bin/hbrsrv-guestinfo.sh, not /usr/bin/hbrsrv-guestinfo.sh like older Broadcom KBs (notably KB 404040) say. The mechanism is the same.
5. Be careful manually setting hbrsrv guestinfo values. The temptation to align hms-thumbprint to hbrsrv-thumbprint (or vice versa) is strong when they don’t match. But if you align to a stale value, you’ve locked in the wrong cert as the source of truth. Always check what hbrsrv is actually serving (openssl s_client -connect 127.0.0.1:8123 -showcerts) and reconcile the stored values to that. Better: run a clean Reconfigure and let the supported workflow handle it.
6. VLR 9.0.5 has a chain-propagation bug, and it doesn’t matter when you apply the cert. VAMI accepts the CA chain in a separate field but doesn’t write it into /opt/vmware/etc/keys/ssl/cert-chain.pem, which is what Envoy actually serves from. After every cert workflow — regardless of whether you followed the runbook order to the letter — manually append your CA to that file and restart envoy-proxy. Until Broadcom fixes this, it’s a permanent step on every cert change.
7. Test the chain on the wire after every cert change. A simple echo | openssl s_client -connect host:443 -showcerts | grep -c "BEGIN CERT" would have flagged the chain issue immediately and saved hours. Run it from a third host (not vCenter, not the appliance), because both have local trust quirks that mask the actual handshake.
8. vsphere-ui caches failed cert validation results. After any cert work that affects VLR’s chain, restart vsphere-ui on vCenter (service-control --stop vsphere-ui && service-control --start vsphere-ui) to clear stale state. Otherwise you’ll see plug-in download failures even after the underlying issue is fixed.
9. The plug-in download path is special. It’s not just generic Java HTTPS — vsphere-ui specifically checks the cert presented by the plug-in URL against a registration thumbprint and validates the chain. If the registered thumbprint goes stale (e.g. because you’ve replaced the appliance cert), even a perfectly-served chain won’t help; you also need to re-register the plug-in via Reconfigure on the VLR side.
10. Have a clean test box. Both vCenter and the appliance have trust paths and caches that will lie to you. Pick a workstation or jumpbox that has no preloaded trust for any of this gear, and use it for s_client testing. Telling truth from cached truth is half the diagnosis.
Closing thoughts
The convergence to VLR 9.0.5 itself is genuinely a good move. One appliance per site instead of two, cleaner upgrade paths, less inter-process trust to worry about. The product works fine on self-signed certs. The bumpy ride above is entirely about applying custom CA-signed certs into the converged appliance, and the mismatch between what VAMI accepts and what Envoy serves.
If you’re on commercial certs from a public CA, you’ll hit none of this — your CA chain is in every browser and JDK on Earth, and the fact that the appliance only serves the leaf doesn’t matter because Java can find the issuer in its own bundled cacerts. The chain bug is invisible. It’s only visible when you’re using an internal CA whose root isn’t already trusted by the JDK that vsphere-ui ships with.
For the next person hitting “Error downloading plug-in. No issuer certificate for certificate in certification path found” with an internal CA on VLR 9.0.5: it’s almost certainly the chain not being on the wire. Run the s_client | grep -c "BEGIN CERT" test from a third host. If it’s 1, append your CA to /opt/vmware/etc/keys/ssl/cert-chain.pem and restart envoy-proxy. That should be all of it.
Comments