Skip to content

bug: standalone VM nodes are still misclassified as VMSS on nodeName-based paths #10217

@nilo19

Description

@nilo19

What happened:

PR #10194 fixed standalone VM classification for providerID and ipConfigurationID, but getVMManagementTypeByNodeName still has the old unconditional short-circuit:

if ss.DisableAvailabilitySetNodes && !ss.EnableVmssFlexNodes {
    return ManagedByVmssUniform, nil
}

That means a standalone VM node can still be misclassified as VMSS uniform on nodeName-based paths when:

  • vmType=vmss
  • DisableAvailabilitySetNodes=true
  • EnableVmssFlexNodes=false

The remaining affected callers include at least:

  • GetPowerStatusByNodeName
  • GetProvisioningStateByNodeName
  • GetInstanceTypeByNodeName
  • GetZoneByNodeName
  • EnsureHostsInPool
  • GetNodeVMSetName

So after #10194, some standalone-VM flows are fixed, but nodeName-based flows can still take the VMSS handler incorrectly.

What you expected to happen:

Standalone VM nodes should not be classified as ManagedByVmssUniform just because DisableAvailabilitySetNodes=true.

NodeName-based classification should be consistent with the providerID/ipConfigurationID fixes from #10194, so standalone VMs route to the availability-set/standalone-VM handler instead of the VMSS-uniform handler.

How to reproduce it (as minimally and precisely as possible):

  1. Configure Azure CCM with:
    • vmType=vmss
    • DisableAvailabilitySetNodes=true
    • EnableVmssFlexNodes=false
  2. Have at least one real standalone VM node in the cluster (providerID format: /providers/Microsoft.Compute/virtualMachines/..., not a VMSS instance).
  3. Trigger any nodeName-based CCM path, for example:
    • GetPowerStatusByNodeName / GetProvisioningStateByNodeName
    • GetInstanceTypeByNodeName
    • GetZoneByNodeName
    • EnsureHostsInPool
    • GetNodeVMSetName
  4. Observe that getVMManagementTypeByNodeName returns ManagedByVmssUniform before consulting any non-VMSS cache/lookup, so the request is routed down the VMSS-uniform path.

Code pointers on current master:

  • pkg/provider/azure_vmss_cache.go: getVMManagementTypeByNodeName
  • pkg/provider/azure_vmss_cache.go: getVMManagementTypeByProviderID
  • pkg/provider/azure_vmss_cache.go: getVMManagementTypeByIPConfigurationID

Anything else we need to know?:

  • This looks like the remaining half of the same standalone-VM problem family addressed by fix: route standalone VM providerID to availability set handler #10194.
  • I validated live that providerID-based node lifecycle paths are active in CCM (GetNodeNameByProviderID from node_lifecycle_controller), and then checked the source to find the remaining unconditional nodeName short-circuit.
  • On the live clusters I used for comparison, DisableAvailabilitySetNodes was not enabled, so this remaining bug was dormant there. That is why I did not see a before/after behavior change from the master image on those clusters.

Environment:

  • Kubernetes version (use kubectl version):
    • Affected by source on current master; live comparison was done on AKS/HCP clusters running custom CCM images.
  • Cloud provider or hardware configuration:
    • Azure
    • vmType=vmss
    • standalone VM nodes present
    • bug requires DisableAvailabilitySetNodes=true and EnableVmssFlexNodes=false
  • OS (e.g: cat /etc/os-release):
    • Linux nodes
  • Kernel (e.g. uname -a):
    • N/A
  • Install tools:
    • CCM in managed AKS/HCP-style deployment
  • Network plugin and version (if this is a network-related bug):
    • N/A
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions