Skip to content

Reduce listPIP 429 #10226

@nilo19

Description

@nilo19

What happened

cloud-controller-manager repeatedly logs serviceOwnsFrontendIP warnings while reconciling an internal LoadBalancer Service with a private pinned IP:

serviceOwnsFrontendIP: unexpected error when finding match public IP of the service ingress-nginx-controller with loadBalancerIP 10.104.176.35:
findMatchedPIPByLoadBalancerIP: failed to listPIP force refresh: throttled due to too many requests

The Service is internal and uses:

service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/azure-load-balancer-ipv4: "10.104.176.35"

During reconcile/cleanup, the ownership check can enter the external/public-PIP branch with this private IP and try to find a Public IP resource whose address is 10.104.176.35. That lookup is a guaranteed miss, and under ARM throttling it can amplify repeated listPIP / force-refresh calls.

Symptoms

  • Repeated serviceOwnsFrontendIP warnings for the same Service/IP.
  • Errors alternate between:
    • cannot find public IP with IP address <private-ip>
    • failed to listPIP
    • failed to listPIP force refresh
  • Reconcile latency increases while ARM is throttling.
  • Logs become noisy around node churn and Service backend syncs.

Suspected code path

Relevant paths:

  • pkg/provider/azure_loadbalancer.go
    • reconcileService
    • reconcileLoadBalancer
    • getServiceLoadBalancerStatus
    • reconcileFrontendIPConfigs
    • isFrontendIPChanged
    • serviceOwnsFrontendIP
  • pkg/provider/azure_publicip_repo.go
    • findMatchedPIP
    • findMatchedPIPByLoadBalancerIP
    • listPIP
  • pkg/cache/azure_cache.go
    • TimedCache.get

Observed flow:

Service reconcile reaches frontend ownership checks
-> serviceOwnsFrontendIP sees loadBalancerIP="10.104.176.35"
-> external/public-PIP branch calls findMatchedPIP
-> listPIP(default)
-> on miss, listPIP(force refresh)
-> 429 or guaranteed not found

Why this is expensive

serviceOwnsFrontendIP is used like a cheap boolean predicate, but for external secondary Services it can trigger ARM-backed Public IP discovery.

One reconcile may call it multiple times:

  • while scanning frontend IP configs for status
  • while reconciling frontend IP configs
  • while checking whether frontend IP changed
  • while deriving owned frontend config IP version

When lookup fails, serviceOwnsFrontendIP logs and returns false. The error is not propagated as a reusable throttling/lookup state, so callers may continue scanning and re-enter the same PIP lookup.

findMatchedPIPByLoadBalancerIP force-refreshes the PIP cache on every miss. The cache does not memoize negative results or 429 failures, so repeated callers can repeatedly hit ARM.

Expected behavior

For a private loadBalancerIP, the public-PIP ownership path should avoid repeated Public IP list/force-refresh calls.

During ARM throttling, repeated ownership checks should avoid re-triggering identical PIP lookups within the same reconcile.

Proposed improvements

  • Short-circuit impossible public-PIP lookups.

    If serviceOwnsFrontendIP is on the external/public-PIP branch and loadBalancerIP is private/RFC1918, skip findMatchedPIP during ownership scanning.

    For real external create/update paths where a user specifies a private IP, return a clear validation error instead of repeatedly listing PIPs.

  • Add per-reconcile memoization for PIP lookup.

    Memoize PIP lookup results by:

    (resourceGroup, loadBalancerIP or pipName)
    

    within one reconcile. Cache both success and failure/throttle state for the reconcile duration.

  • Stop flattening lookup errors into false.

    Change ownership resolution to distinguish:

    owned=false, err=nil        // definitely not owned
    owned=false, err=throttled  // lookup unknown due to throttling
    owned=false, err=notFound   // lookup completed, no match
    

    Then the top-level reconcile can log once and back off instead of continuing as if the frontend simply belongs to another Service.

  • Reduce NSG retain log amplification.

    listAvailableSecurityGroupDestinations and RetainDestinationFromRules can log very large duplicated destination lists. De-duping retained destinations before retain/logging would reduce noise and CPU overhead, though it does not directly fix the PIP 429 loop.

Acceptance criteria

  • Private loadBalancerIP values do not trigger Public IP list/force-refresh calls from ownership scanning.
  • Repeated serviceOwnsFrontendIP checks in one reconcile do not repeat identical PIP list calls after a miss or 429.
  • 429 from PIP list is propagated or memoized as an unknown/throttled lookup state, not silently converted to false.
  • Unit tests cover:
    • public-PIP ownership scan with private loadBalancerIP skips findMatchedPIP
    • external Service with private pinned IP fails/skips early as designed
    • repeated ownership checks reuse memoized PIP lookup result
    • throttled PIP lookup is not retried multiple times in one reconcile

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions