Cloud Controller Manager restarts cause repeated LoadBalancer service disruptions with externalTrafficPolicy: Local

**Environment:**
Kubernetes Version: 1.33.x (self-managed cluster)
Cloud Provider: Azure

During a routine maintenance operation, CCM pods experienced multiple restarts over a 3+ hour period. Each CCM pod restart triggered the following sequence:

1. Leader election lease expires (~15 seconds)
2. New CCM pod acquires leadership
3. New leader's Service controller performs a full sync of all LoadBalancer services
4. Azure Load Balancer reconfigures backend pools and health probes
5. Active connections through the LoadBalancer are terminated

With externalTrafficPolicy: Local, the impact is amplified because:

1. Traffic is only routed to nodes with running service pods
2. Azure LB health checks (HealthCheckNodePort) are recalculated during each reconciliation
3. Brief windows exist where backends are marked unhealthy during the transition

**Impact**

- 6 CCM pod transitions occurred during the incident

- Each transition triggered 3-5 service reconciliation events

- Long-lived connections (websockets) were repeatedly dropped

- Total disruption window extended to ~3.5 hours due to cascading reconciliations

**Expected Behavior**

CCM restarts should minimize disruption to existing LoadBalancer services, particularly when:

- The underlying service configuration has not changed

- Backend pods remain healthy and unchanged

- Only the CCM pod itself is restarting

**Questions for Maintainers**

1. Is there a mechanism to perform incremental reconciliation rather than full sync on leader election?
2. Can the Service controller detect that no actual changes occurred and skip Azure API calls?
3. Are there recommended configurations to reduce disruption during CCM pod transitions?
4. Would implementing connection draining or gradual backend pool updates help mitigate this?

This behavior appears to be by design based on how the Service controller performs full reconciliation on startup, but the impact on production workloads with long-lived connections is significant. We are looking for guidance on best practices or potential enhancements to reduce this impact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Controller Manager restarts cause repeated LoadBalancer service disruptions with externalTrafficPolicy: Local #10224

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cloud Controller Manager restarts cause repeated LoadBalancer service disruptions with externalTrafficPolicy: Local #10224

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions