Handle NICs without VM associations gracefully by cgiradkar · Pull Request #10164 · kubernetes-sigs/cloud-provider-azure

cgiradkar · 2026-04-08T16:18:50Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Load balancer creation fails completely when any node in the cluster has a NIC that is not yet fully associated with a VM. This occurs during normal cluster operations when:

New node pools are scaling up - VMs take time to fully provision and associate with their NICs
Quota or capacity constraints - VM creation is delayed while Azure allocates resources
Any VM provisioning delay - Network, region capacity, or transient API issues

Impact:

Load balancer services cannot be created until all VMs are fully provisioned
Blocks application deployment that requires load balancer services
No automatic recovery - LB creation remains blocked even after VMs finish provisioning (requires manual intervention or CCM restart)
Affects both VMSS and VMSS Flex deployments

Expected Behavior:

The load balancer should be created with currently-available nodes, and automatically add delayed nodes once their VMs are fully provisioned (on subsequent sync cycles).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The change modifies error handling behavior in several key areas:

GetVMNameByIPConfigurationName - Returns ("", nil) instead of an error when a NIC has no VM association
getVMManagementTypeByIPConfigurationID - Returns new ManagedByNoVM type when vmName is empty
GetNodeNameByIPConfigurationID (VMSS) - Checks for ManagedByNoVM and returns ("", "", nil) early
getNodeInformationByIPConfigurationID (Flex) - Checks for empty vmName and returns early with logging
Backend pool reconciliation - Guards against empty nodeName in two locations:
- ReconcileBackendPools - skips processing
- GetBackendPrivateIPs - skips IP collection
ensureBackendPoolDeleted - Skips nodes with empty nodeName

Added comprehensive test coverage:

New test case in TestGetNodeNameByIPConfigurationID for NIC without VM
New test case in TestEnsureBackendPoolDeleted for orphaned NIC in backend pool
New test case in TestGetNodeNameByIPConfigurationIDVmssFlex for orphaned NIC
Updated TestGetVMManagementTypeByIPConfigurationID to expect ManagedByNoVM instead of error

Does this PR introduce a user-facing change?

Handle NICs without VM associations gracefully in VMSS and VMSS Flex code paths

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

linux-foundation-easycla · 2026-04-08T16:18:59Z

The committers listed above are authorized under a signed CLA.

✅ login: cgiradkar / name: Chetan Giradkar (c6a74a2)

k8s-ci-robot · 2026-04-08T16:19:01Z

Welcome @cgiradkar!

It looks like this is your first PR to kubernetes-sigs/cloud-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cloud-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-04-08T16:19:02Z

Hi @cgiradkar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-04-08T16:19:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cgiradkar
Once this PR has been reviewed and has the lgtm label, please assign elmiko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cgiradkar · 2026-04-08T16:36:07Z

/easycla

cgiradkar · 2026-04-09T09:54:22Z

@feiskyer @nilo19 @andyzhangx would you be able to have a look at this PR please?

nilo19 · 2026-04-10T00:51:28Z

will check.

cgiradkar · 2026-04-10T14:42:32Z

@nilo19 can I get a ok-to-test in the meanwhile. This bug is having some impact in my code, so I'd thankful if you review this small PR sooner. Thanks

marek-veber

The fix is directionally correct and solves a real user-facing issue. I'd want to see:

Logging added to the VMSS paths for parity with VMSS Flex
At least one test per new branch (especially the ensureBackendPoolDeleted skip)
A quick audit of other GetNodeNameByIPConfigurationID callers

Concerns:

Insufficient test coverage. Only the existing TestGetVMManagementTypeByIPConfigurationID test was updated. The new behavior in these functions lacks test cases:
- GetNodeNameByIPConfigurationID — no test for the ManagedByUnknownVMSet early return
- ensureBackendPoolDeleted — no test for the nodeName == "" skip
- getNodeInformationByIPConfigurationID — no test for the empty vmName path
Callers of GetNodeNameByIPConfigurationID beyond ensureBackendPoolDeleted. The PR adds an empty-string return for ManagedByUnknownVMSet, and ensureBackendPoolDeleted handles it with a continue. But if there are other callers of GetNodeNameByIPConfigurationID elsewhere in the codebase, they may not expect ("", "", nil) and could misbehave (e.g., passing an empty node name to a downstream function). This should be audited.
ManagedByUnknownVMSet semantics are overloaded. Before this PR, ManagedByUnknownVMSet with a non-nil error meant "something went wrong." Now it's also used (with nil error) to mean "NIC exists but no VM yet." This dual meaning could confuse future readers. A dedicated constant like ManagedByNoVM or a comment clarifying the nil-error variant would improve clarity.

marek-veber · 2026-04-14T12:45:28Z

 			continue
 		}
+		if nodeName == "" {
+			continue


IMHO: Some logger.Info(...) will be useful here.

added logs aptly

marek-veber · 2026-04-14T12:45:41Z

 	}

+	if vmManagementType == ManagedByUnknownVMSet {
+		return "", "", nil


IMHO: Some logger.Info(...) will be useful here.

added logs aptly

cgiradkar · 2026-04-14T15:11:27Z

The fix is directionally correct and solves a real user-facing issue. I'd want to see:

Logging added to the VMSS paths for parity with VMSS Flex

At least one test per new branch (especially the ensureBackendPoolDeleted skip)

A quick audit of other GetNodeNameByIPConfigurationID callers

Concerns:

Insufficient test coverage. Only the existing TestGetVMManagementTypeByIPConfigurationID test was updated. The new behavior in these functions lacks test cases:

GetNodeNameByIPConfigurationID — no test for the ManagedByUnknownVMSet early return

ensureBackendPoolDeleted — no test for the nodeName == "" skip

getNodeInformationByIPConfigurationID — no test for the empty vmName path

Callers of GetNodeNameByIPConfigurationID beyond ensureBackendPoolDeleted. The PR adds an empty-string return for ManagedByUnknownVMSet, and ensureBackendPoolDeleted handles it with a continue. But if there are other callers of GetNodeNameByIPConfigurationID elsewhere in the codebase, they may not expect ("", "", nil) and could misbehave (e.g., passing an empty node name to a downstream function). This should be audited.

ManagedByUnknownVMSet semantics are overloaded. Before this PR, ManagedByUnknownVMSet with a non-nil error meant "something went wrong." Now it's also used (with nil error) to mean "NIC exists but no VM yet." This dual meaning could confuse future readers. A dedicated constant like ManagedByNoVM or a comment clarifying the nil-error variant would improve clarity.

[✓ ] Add logs aptly
[✓ ] Add more unit tests for coverage over new code
[ ✓] Audit callers of GetNodeNameByIPConfigurationID for empty-string safety
[ ✓] Improve ManagedByUnknownVMSet semantics

cblecker · 2026-04-15T15:35:30Z

/ok-to-test

This change addresses scenarios where NICs exist but have no attached VM, preventing errors during load balancer operations. The fix returns empty node names instead of errors, allowing calling code to skip these entries. Key changes: - Return empty string from GetVMNameByIPConfigurationName for unattached NICs - Add ManagedByUnknownVMSet handling in GetNodeNameByIPConfigurationID paths - Add logging (V2) when skipping nodes due to missing VM associations - Guard all GetNodeNameByIPConfigurationID callers against empty node names - Add comprehensive test coverage for all new code paths (VMSS, Flex, callers) - Document ManagedByUnknownVMSet dual semantics (error vs orphaned NIC) Fixes issue where backend pool reconciliation would fail when encountering NICs created but not yet attached to VMs during cluster scaling operations. Signed-off-by: Chetan Giradkar <cgiradka@redhat.com>

anndono · 2026-04-16T01:45:04Z

/retest

cgiradkar · 2026-04-22T14:11:28Z

@nilo19 @cheftako can you PTAL

nilo19 · 2026-05-10T02:28:03Z

+	// Return empty when VM association is missing
 	if nic.Properties == nil || nic.Properties.VirtualMachine == nil || nic.Properties.VirtualMachine.ID == nil {
-		return "", fmt.Errorf("failed to get vm ID of nic %s", ptr.Deref(nic.Name, ""))
+		return "", nil


Do you manually create the nic? When will the nic has no vm association? If this is a problem we must have been received a lot of tickets, but we haven't. May I know if there is a special use case from your side?

A customer had reported that when creating a Kubernetes service of type LoadBalancer, the operation fails if any VMs in the node pool are still provisioning or stuck in creation. The expectation is that the load balancer should skip VMs that are not yet ready and add them later once they complete provisioning, rather than failing the entire operation.

SyncLoadBalancerFailed Error syncing load balancer: failed to ensure load balancer: failed to get vm name by ip config ID /subscriptions/.../xxxx-xxxx-xxxx-nic/ipConfigurations/pipConfig: failed to get vm ID of xxxx-xxxx-xxxx-nic

Steps to Reproduce

Create an ARO-HCP cluster

Create a node pool that requests VMs exceeding existing quota (to simulate delayed/stuck provisioning)

Wait until node provisioning gets stuck or delayed

Attempt to provision a service of type LoadBalancer

A customer had reported that when creating a Kubernetes service of type LoadBalancer, the operation fails if any VMs in the node pool are still provisioning or stuck in creation. The expectation is that the load balancer should skip VMs that are not yet ready and add them later once they complete provisioning, rather than failing the entire operation.

SyncLoadBalancerFailed Error syncing load balancer: failed to ensure load balancer: failed to get vm name by ip config ID /subscriptions/.../xxxx-xxxx-xxxx-nic/ipConfigurations/pipConfig: failed to get vm ID of xxxx-xxxx-xxxx-nic

If this is the case, in the next reconcile when the vm has been ready, it should auto fixed.

BTW, what is ARO-HCP cluster? Is this issue only in ARO-HCP clusters?

BTW, what is ARO-HCP cluster? Is this issue only in ARO-HCP clusters?

ARO-HCP: https://github.com/Azure/ARO-HCP : uses standard OpenShift cloud-controller-manager (which is based on cloud-provider-azure)
This issue will occur in any cluster using cloud-provider-azure

If this is the case, in the next reconcile when the vm has been ready, it should auto fixed.

In this specific case, the whole array of VMS are never gonna get provisioned (due to said constraints), so waiting for next cycle wont do it as the state didnt change so no processing would follow from this file.
This PR makes the process of VM state transiotion (among an array of VMs to be provisioned) more granular even if the re-sync duration is bearable.

But if there is a genuine issue from the NRP/NIC side, this change will silently hide the error.

would it be an acceptable solution of we log it and proceed? This use case is an edge case but still encountered in production.

I think it would be more reasonable to solve the root cause, so the ccm reconcile will be unblocked. This change introduces behavior change to solve an edge case, and I'm not sure if it's worth it, as it may introduce other issues.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Apr 8, 2026

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 8, 2026

k8s-ci-robot requested a review from cheftako April 8, 2026 16:19

k8s-ci-robot requested a review from nilo19 April 8, 2026 16:19

github-actions Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 8, 2026

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 8, 2026

cgiradkar marked this pull request as ready for review April 8, 2026 16:36

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026

k8s-ci-robot requested review from elmiko and feiskyer April 8, 2026 16:36

cgiradkar force-pushed the fix/handle-missing-nic-vm-association branch from 7b65272 to f55ea6c Compare April 8, 2026 17:44

marek-veber suggested changes Apr 14, 2026

View reviewed changes

cgiradkar force-pushed the fix/handle-missing-nic-vm-association branch from f55ea6c to a9aac12 Compare April 14, 2026 15:06

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 14, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 15, 2026

cgiradkar force-pushed the fix/handle-missing-nic-vm-association branch from a9aac12 to 0865d71 Compare April 15, 2026 15:43

cgiradkar requested a review from marek-veber April 15, 2026 15:43

cgiradkar force-pushed the fix/handle-missing-nic-vm-association branch from 0865d71 to c6a74a2 Compare April 15, 2026 17:07

nilo19 reviewed May 10, 2026

View reviewed changes

Conversation

cgiradkar commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Impact:

Expected Behavior:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

cgiradkar commented Apr 8, 2026

Uh oh!

cgiradkar commented Apr 9, 2026

Uh oh!

nilo19 commented Apr 10, 2026

Uh oh!

cgiradkar commented Apr 10, 2026

Uh oh!

marek-veber left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgiradkar commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cblecker commented Apr 15, 2026

Uh oh!

anndono commented Apr 16, 2026

Uh oh!

cgiradkar commented Apr 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Steps to Reproduce

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgiradkar May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgiradkar May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

cgiradkar commented Apr 8, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Apr 8, 2026 •

edited

Loading

cgiradkar commented Apr 14, 2026 •

edited

Loading

cgiradkar May 14, 2026 •

edited

Loading

cgiradkar May 14, 2026 •

edited

Loading