-
Notifications
You must be signed in to change notification settings - Fork 319
Handle NICs without VM associations gracefully #10164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1697,6 +1697,11 @@ func (ss *ScaleSet) GetNodeNameByIPConfigurationID(ctx context.Context, ipConfig | |
| return "", "", err | ||
| } | ||
|
|
||
| if vmManagementType == ManagedByNoVM { | ||
| logger.V(2).Info("No VM attached, skipping node", "ipConfigurationID", ipConfigurationID) | ||
| return "", "", nil | ||
| } | ||
|
|
||
| if vmManagementType == ManagedByAvSet { | ||
| // vm is managed by availability set. | ||
| return ss.availabilitySet.GetNodeNameByIPConfigurationID(ctx, ipConfigurationID) | ||
|
|
@@ -1963,6 +1968,10 @@ func (ss *ScaleSet) ensureBackendPoolDeleted(ctx context.Context, service *v1.Se | |
| allErrs = append(allErrs, err) | ||
| continue | ||
| } | ||
| if nodeName == "" { | ||
| logger.V(2).Info("Empty nodeName, skipping node", "service", getServiceName(service), "ipConfigurationID", ipConfigurationID) | ||
| continue | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO: Some logger.Info(...) will be useful here.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added logs aptly |
||
| } | ||
|
|
||
| nodeResourceGroup, nodeVMSS, nodeInstanceID, nodeVMSSVM, err := ss.ensureBackendPoolDeletedFromNode(ctx, nodeName, backendPoolIDs) | ||
| if err != nil { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -63,6 +63,7 @@ const ( | |
| ManagedByVmssFlex VMManagementType = "ManagedByVmssFlex" | ||
| ManagedByAvSet VMManagementType = "ManagedByAvSet" | ||
| ManagedByUnknownVMSet VMManagementType = "ManagedByUnknownVMSet" | ||
| ManagedByNoVM VMManagementType = "ManagedByNoVM" | ||
| ) | ||
|
|
||
| func (ss *ScaleSet) newVMSSCache() (azcache.Resource, error) { | ||
|
|
@@ -530,6 +531,9 @@ func (ss *ScaleSet) getVMManagementTypeByIPConfigurationID(ctx context.Context, | |
| if err != nil { | ||
| return ManagedByUnknownVMSet, fmt.Errorf("failed to get vm name by ip config ID %s: %w", ipConfigurationID, err) | ||
| } | ||
| if vmName == "" { | ||
| return ManagedByNoVM, nil | ||
| } | ||
| if cachedAvSetVMs.Has(vmName) { | ||
| return ManagedByAvSet, nil | ||
| } | ||
|
|
@@ -542,10 +546,14 @@ func (az *Cloud) GetVMNameByIPConfigurationName(ctx context.Context, nicResource | |
| if rerr != nil { | ||
| return "", fmt.Errorf("failed to get interface of name %s: %w", nicName, rerr) | ||
| } | ||
| // Return empty when VM association is missing | ||
| if nic.Properties == nil || nic.Properties.VirtualMachine == nil || nic.Properties.VirtualMachine.ID == nil { | ||
| return "", fmt.Errorf("failed to get vm ID of nic %s", ptr.Deref(nic.Name, "")) | ||
| return "", nil | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you manually create the nic? When will the nic has no vm association? If this is a problem we must have been received a lot of tickets, but we haven't. May I know if there is a special use case from your side?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A customer had reported that when creating a Kubernetes service of type LoadBalancer, the operation fails if any VMs in the node pool are still provisioning or stuck in creation. The expectation is that the load balancer should skip VMs that are not yet ready and add them later once they complete provisioning, rather than failing the entire operation.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Steps to Reproduce
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If this is the case, in the next reconcile when the vm has been ready, it should auto fixed.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, what is ARO-HCP cluster? Is this issue only in ARO-HCP clusters?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
ARO-HCP: https://github.com/Azure/ARO-HCP : uses standard OpenShift cloud-controller-manager (which is based on cloud-provider-azure)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In this specific case, the whole array of VMS are never gonna get provisioned (due to said constraints), so waiting for next cycle wont do it as the state didnt change so no processing would follow from this file.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But if there is a genuine issue from the NRP/NIC side, this change will silently hide the error.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would it be an acceptable solution of we log it and proceed? This use case is an edge case but still encountered in production.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be more reasonable to solve the root cause, so the ccm reconcile will be unblocked. This change introduces behavior change to solve an edge case, and I'm not sure if it's worth it, as it may introduce other issues. |
||
| } | ||
| vmID := ptr.Deref(nic.Properties.VirtualMachine.ID, "") | ||
| if vmID == "" { | ||
| return "", nil | ||
| } | ||
| matches := vmIDRE.FindStringSubmatch(vmID) | ||
| if len(matches) != 2 { | ||
| return "", fmt.Errorf("invalid virtual machine ID %s", vmID) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO: Some logger.Info(...) will be useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added logs aptly