Skip to content

Handle NICs without VM associations gracefully#10164

Open
cgiradkar wants to merge 1 commit into
kubernetes-sigs:masterfrom
cgiradkar:fix/handle-missing-nic-vm-association
Open

Handle NICs without VM associations gracefully#10164
cgiradkar wants to merge 1 commit into
kubernetes-sigs:masterfrom
cgiradkar:fix/handle-missing-nic-vm-association

Conversation

@cgiradkar
Copy link
Copy Markdown

@cgiradkar cgiradkar commented Apr 8, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

Load balancer creation fails completely when any node in the cluster has a NIC that is not yet fully associated with a VM. This occurs during normal cluster operations when:

  1. New node pools are scaling up - VMs take time to fully provision and associate with their NICs
  2. Quota or capacity constraints - VM creation is delayed while Azure allocates resources
  3. Any VM provisioning delay - Network, region capacity, or transient API issues
Impact:
  • Load balancer services cannot be created until all VMs are fully provisioned
  • Blocks application deployment that requires load balancer services
  • No automatic recovery - LB creation remains blocked even after VMs finish provisioning (requires manual intervention or CCM restart)
  • Affects both VMSS and VMSS Flex deployments
Expected Behavior:

The load balancer should be created with currently-available nodes, and automatically add delayed nodes once their VMs are fully provisioned (on subsequent sync cycles).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The change modifies error handling behavior in several key areas:

  1. GetVMNameByIPConfigurationName - Returns ("", nil) instead of an error when a NIC has no VM association
  2. getVMManagementTypeByIPConfigurationID - Returns new ManagedByNoVM type when vmName is empty
  3. GetNodeNameByIPConfigurationID (VMSS) - Checks for ManagedByNoVM and returns ("", "", nil) early
  4. getNodeInformationByIPConfigurationID (Flex) - Checks for empty vmName and returns early with logging
  5. Backend pool reconciliation - Guards against empty nodeName in two locations:
    • ReconcileBackendPools - skips processing
    • GetBackendPrivateIPs - skips IP collection
  6. ensureBackendPoolDeleted - Skips nodes with empty nodeName

Added comprehensive test coverage:

  • New test case in TestGetNodeNameByIPConfigurationID for NIC without VM
  • New test case in TestEnsureBackendPoolDeleted for orphaned NIC in backend pool
  • New test case in TestGetNodeNameByIPConfigurationIDVmssFlex for orphaned NIC
  • Updated TestGetVMManagementTypeByIPConfigurationID to expect ManagedByNoVM instead of error

Does this PR introduce a user-facing change?

Handle NICs without VM associations gracefully in VMSS and VMSS Flex code paths

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Apr 8, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 8, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: cgiradkar / name: Chetan Giradkar (c6a74a2)

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @cgiradkar!

It looks like this is your first PR to kubernetes-sigs/cloud-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cloud-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 8, 2026
@k8s-ci-robot k8s-ci-robot requested a review from cheftako April 8, 2026 16:19
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @cgiradkar. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested a review from nilo19 April 8, 2026 16:19
@github-actions github-actions Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 8, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cgiradkar
Once this PR has been reviewed and has the lgtm label, please assign elmiko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 8, 2026
@cgiradkar
Copy link
Copy Markdown
Author

/easycla

@cgiradkar cgiradkar marked this pull request as ready for review April 8, 2026 16:36
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026
@k8s-ci-robot k8s-ci-robot requested review from elmiko and feiskyer April 8, 2026 16:36
@cgiradkar cgiradkar force-pushed the fix/handle-missing-nic-vm-association branch from 7b65272 to f55ea6c Compare April 8, 2026 17:44
@cgiradkar
Copy link
Copy Markdown
Author

@feiskyer @nilo19 @andyzhangx would you be able to have a look at this PR please?

@nilo19
Copy link
Copy Markdown
Contributor

nilo19 commented Apr 10, 2026

will check.

@cgiradkar
Copy link
Copy Markdown
Author

@nilo19 can I get a ok-to-test in the meanwhile. This bug is having some impact in my code, so I'd thankful if you review this small PR sooner. Thanks

Copy link
Copy Markdown

@marek-veber marek-veber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is directionally correct and solves a real user-facing issue. I'd want to see:

  • Logging added to the VMSS paths for parity with VMSS Flex
  • At least one test per new branch (especially the ensureBackendPoolDeleted skip)
  • A quick audit of other GetNodeNameByIPConfigurationID callers

Concerns:

  1. Insufficient test coverage. Only the existing TestGetVMManagementTypeByIPConfigurationID test was updated. The new behavior in these functions lacks test cases:
    • GetNodeNameByIPConfigurationID — no test for the ManagedByUnknownVMSet early return
    • ensureBackendPoolDeleted — no test for the nodeName == "" skip
    • getNodeInformationByIPConfigurationID — no test for the empty vmName path
  2. Callers of GetNodeNameByIPConfigurationID beyond ensureBackendPoolDeleted. The PR adds an empty-string return for ManagedByUnknownVMSet, and ensureBackendPoolDeleted handles it with a continue. But if there are other callers of GetNodeNameByIPConfigurationID elsewhere in the codebase, they may not expect ("", "", nil) and could misbehave (e.g., passing an empty node name to a downstream function). This should be audited.
  3. ManagedByUnknownVMSet semantics are overloaded. Before this PR, ManagedByUnknownVMSet with a non-nil error meant "something went wrong." Now it's also used (with nil error) to mean "NIC exists but no VM yet." This dual meaning could confuse future readers. A dedicated constant like ManagedByNoVM or a comment clarifying the nil-error variant would improve clarity.

continue
}
if nodeName == "" {
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO: Some logger.Info(...) will be useful here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added logs aptly

}

if vmManagementType == ManagedByUnknownVMSet {
return "", "", nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO: Some logger.Info(...) will be useful here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added logs aptly

@cgiradkar cgiradkar force-pushed the fix/handle-missing-nic-vm-association branch from f55ea6c to a9aac12 Compare April 14, 2026 15:06
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 14, 2026
@cgiradkar
Copy link
Copy Markdown
Author

cgiradkar commented Apr 14, 2026

The fix is directionally correct and solves a real user-facing issue. I'd want to see:

  • Logging added to the VMSS paths for parity with VMSS Flex
  • At least one test per new branch (especially the ensureBackendPoolDeleted skip)
  • A quick audit of other GetNodeNameByIPConfigurationID callers

Concerns:

  1. Insufficient test coverage. Only the existing TestGetVMManagementTypeByIPConfigurationID test was updated. The new behavior in these functions lacks test cases:

    • GetNodeNameByIPConfigurationID — no test for the ManagedByUnknownVMSet early return
    • ensureBackendPoolDeleted — no test for the nodeName == "" skip
    • getNodeInformationByIPConfigurationID — no test for the empty vmName path
  2. Callers of GetNodeNameByIPConfigurationID beyond ensureBackendPoolDeleted. The PR adds an empty-string return for ManagedByUnknownVMSet, and ensureBackendPoolDeleted handles it with a continue. But if there are other callers of GetNodeNameByIPConfigurationID elsewhere in the codebase, they may not expect ("", "", nil) and could misbehave (e.g., passing an empty node name to a downstream function). This should be audited.

  3. ManagedByUnknownVMSet semantics are overloaded. Before this PR, ManagedByUnknownVMSet with a non-nil error meant "something went wrong." Now it's also used (with nil error) to mean "NIC exists but no VM yet." This dual meaning could confuse future readers. A dedicated constant like ManagedByNoVM or a comment clarifying the nil-error variant would improve clarity.

  • [✓ ] Add logs aptly
  • [✓ ] Add more unit tests for coverage over new code
  • [ ✓] Audit callers of GetNodeNameByIPConfigurationID for empty-string safety
  • [ ✓] Improve ManagedByUnknownVMSet semantics

@cblecker
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 15, 2026
@cgiradkar cgiradkar force-pushed the fix/handle-missing-nic-vm-association branch from a9aac12 to 0865d71 Compare April 15, 2026 15:43
@cgiradkar cgiradkar requested a review from marek-veber April 15, 2026 15:43
This change addresses scenarios where NICs exist but have no attached VM,
preventing errors during load balancer operations. The fix returns empty
node names instead of errors, allowing calling code to skip these entries.

Key changes:
- Return empty string from GetVMNameByIPConfigurationName for unattached NICs
- Add ManagedByUnknownVMSet handling in GetNodeNameByIPConfigurationID paths
- Add logging (V2) when skipping nodes due to missing VM associations
- Guard all GetNodeNameByIPConfigurationID callers against empty node names
- Add comprehensive test coverage for all new code paths (VMSS, Flex, callers)
- Document ManagedByUnknownVMSet dual semantics (error vs orphaned NIC)

Fixes issue where backend pool reconciliation would fail when encountering
NICs created but not yet attached to VMs during cluster scaling operations.

Signed-off-by: Chetan Giradkar <cgiradka@redhat.com>
@cgiradkar cgiradkar force-pushed the fix/handle-missing-nic-vm-association branch from 0865d71 to c6a74a2 Compare April 15, 2026 17:07
@anndono
Copy link
Copy Markdown
Contributor

anndono commented Apr 16, 2026

/retest

@cgiradkar
Copy link
Copy Markdown
Author

@nilo19 @cheftako can you PTAL

// Return empty when VM association is missing
if nic.Properties == nil || nic.Properties.VirtualMachine == nil || nic.Properties.VirtualMachine.ID == nil {
return "", fmt.Errorf("failed to get vm ID of nic %s", ptr.Deref(nic.Name, ""))
return "", nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you manually create the nic? When will the nic has no vm association? If this is a problem we must have been received a lot of tickets, but we haven't. May I know if there is a special use case from your side?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A customer had reported that when creating a Kubernetes service of type LoadBalancer, the operation fails if any VMs in the node pool are still provisioning or stuck in creation. The expectation is that the load balancer should skip VMs that are not yet ready and add them later once they complete provisioning, rather than failing the entire operation.

SyncLoadBalancerFailed

Error syncing load balancer: failed to ensure load balancer: failed to get vm name by ip config ID /subscriptions/.../xxxx-xxxx-xxxx-nic/ipConfigurations/pipConfig: failed to get vm ID of xxxx-xxxx-xxxx-nic

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steps to Reproduce

  1. Create an ARO-HCP cluster
  2. Create a node pool that requests VMs exceeding existing quota (to simulate delayed/stuck provisioning)
  3. Wait until node provisioning gets stuck or delayed
  4. Attempt to provision a service of type LoadBalancer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A customer had reported that when creating a Kubernetes service of type LoadBalancer, the operation fails if any VMs in the node pool are still provisioning or stuck in creation. The expectation is that the load balancer should skip VMs that are not yet ready and add them later once they complete provisioning, rather than failing the entire operation.

SyncLoadBalancerFailed

Error syncing load balancer: failed to ensure load balancer: failed to get vm name by ip config ID /subscriptions/.../xxxx-xxxx-xxxx-nic/ipConfigurations/pipConfig: failed to get vm ID of xxxx-xxxx-xxxx-nic

If this is the case, in the next reconcile when the vm has been ready, it should auto fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, what is ARO-HCP cluster? Is this issue only in ARO-HCP clusters?

Copy link
Copy Markdown
Author

@cgiradkar cgiradkar May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, what is ARO-HCP cluster? Is this issue only in ARO-HCP clusters?

ARO-HCP: https://github.com/Azure/ARO-HCP : uses standard OpenShift cloud-controller-manager (which is based on cloud-provider-azure)
This issue will occur in any cluster using cloud-provider-azure

Copy link
Copy Markdown
Author

@cgiradkar cgiradkar May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case, in the next reconcile when the vm has been ready, it should auto fixed.

In this specific case, the whole array of VMS are never gonna get provisioned (due to said constraints), so waiting for next cycle wont do it as the state didnt change so no processing would follow from this file.
This PR makes the process of VM state transiotion (among an array of VMs to be provisioned) more granular even if the re-sync duration is bearable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if there is a genuine issue from the NRP/NIC side, this change will silently hide the error.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be an acceptable solution of we log it and proceed? This use case is an edge case but still encountered in production.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more reasonable to solve the root cause, so the ccm reconcile will be unblocked. This change introduces behavior change to solve an edge case, and I'm not sure if it's worth it, as it may introduce other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants