Environment:
Kubernetes Version: 1.33.x (self-managed cluster)
Cloud Provider: Azure
During a routine maintenance operation, CCM pods experienced multiple restarts over a 3+ hour period. Each CCM pod restart triggered the following sequence:
- Leader election lease expires (~15 seconds)
- New CCM pod acquires leadership
- New leader's Service controller performs a full sync of all LoadBalancer services
- Azure Load Balancer reconfigures backend pools and health probes
- Active connections through the LoadBalancer are terminated
With externalTrafficPolicy: Local, the impact is amplified because:
- Traffic is only routed to nodes with running service pods
- Azure LB health checks (HealthCheckNodePort) are recalculated during each reconciliation
- Brief windows exist where backends are marked unhealthy during the transition
Impact
-
6 CCM pod transitions occurred during the incident
-
Each transition triggered 3-5 service reconciliation events
-
Long-lived connections (websockets) were repeatedly dropped
-
Total disruption window extended to ~3.5 hours due to cascading reconciliations
Expected Behavior
CCM restarts should minimize disruption to existing LoadBalancer services, particularly when:
-
The underlying service configuration has not changed
-
Backend pods remain healthy and unchanged
-
Only the CCM pod itself is restarting
Questions for Maintainers
- Is there a mechanism to perform incremental reconciliation rather than full sync on leader election?
- Can the Service controller detect that no actual changes occurred and skip Azure API calls?
- Are there recommended configurations to reduce disruption during CCM pod transitions?
- Would implementing connection draining or gradual backend pool updates help mitigate this?
This behavior appears to be by design based on how the Service controller performs full reconciliation on startup, but the impact on production workloads with long-lived connections is significant. We are looking for guidance on best practices or potential enhancements to reduce this impact.
Environment:
Kubernetes Version: 1.33.x (self-managed cluster)
Cloud Provider: Azure
During a routine maintenance operation, CCM pods experienced multiple restarts over a 3+ hour period. Each CCM pod restart triggered the following sequence:
With externalTrafficPolicy: Local, the impact is amplified because:
Impact
6 CCM pod transitions occurred during the incident
Each transition triggered 3-5 service reconciliation events
Long-lived connections (websockets) were repeatedly dropped
Total disruption window extended to ~3.5 hours due to cascading reconciliations
Expected Behavior
CCM restarts should minimize disruption to existing LoadBalancer services, particularly when:
The underlying service configuration has not changed
Backend pods remain healthy and unchanged
Only the CCM pod itself is restarting
Questions for Maintainers
This behavior appears to be by design based on how the Service controller performs full reconciliation on startup, but the impact on production workloads with long-lived connections is significant. We are looking for guidance on best practices or potential enhancements to reduce this impact.