feat: rollout restart single-replica Deployments instead of evicting#1841
feat: rollout restart single-replica Deployments instead of evicting#1841dmitriy-myz wants to merge 1 commit into
Conversation
For Deployments with replicas=1 and RollingUpdate strategy, trigger a rollout restart (patch pod template annotation) instead of using the Pod Eviction API. This avoids downtime by letting the Deployment controller create a new pod before terminating the old one. Falls through to normal eviction on errors, non-Deployment pods, multi-replica Deployments, or Recreate strategy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @dmitriy-myz! |
|
Hi @dmitriy-myz. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
This pull request implements a rollout restart mechanism for single-replica Deployments instead of using the Pod Eviction API. The goal is to avoid downtime during descheduling by allowing the Deployment controller to create a new pod before terminating the old one, addressing issues #786 and #1558.
Changes:
- Modified eviction logic to detect single-replica Deployments with RollingUpdate strategy and trigger a rollout restart instead of direct eviction
- Added RBAC permissions for reading ReplicaSets and Deployments, and patching Deployments
- Added comprehensive test coverage for various scenarios including multi-replica, non-Deployment pods, Recreate strategy, unhealthy deployments, and deduplication
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/descheduler/evictions/evictions.go | Implements rollout restart logic including owner chain resolution (Pod → ReplicaSet → Deployment), deployment patching, and integration with existing eviction flow |
| pkg/descheduler/evictions/evictions_test.go | Adds comprehensive test coverage for rollout restart feature covering edge cases like multi-replica, non-Deployment pods, Recreate strategy, deduplication, dry-run mode, and unhealthy deployments |
| pkg/descheduler/kubeclientsandbox.go | Registers ReplicaSets and Deployments informers in the sandbox for dry-run mode support |
| kubernetes/base/rbac.yaml | Adds ClusterRole permissions for reading ReplicaSets and Deployments, and patching Deployments |
| charts/descheduler/templates/clusterrole.yaml | Adds ClusterRole permissions for reading ReplicaSets and Deployments, and patching Deployments (Helm chart) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // rolloutRestartDeployment patches the Deployment's pod template annotation to trigger a rolling restart. | ||
| func (pe *PodEvictor) rolloutRestartDeployment(ctx context.Context, deploy *appsv1.Deployment) error { | ||
| patch := fmt.Sprintf(`{"spec":{"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"%s"}}}}}`, time.Now().Format(time.RFC3339)) |
There was a problem hiding this comment.
The annotation key "kubectl.kubernetes.io/restartedAt" is used here to trigger a rollout restart, which matches the convention used by kubectl rollout restart. However, consider whether descheduler should use its own annotation key (e.g., "descheduler.sigs.k8s.io/restartedAt") to clearly indicate the source of the restart and avoid confusion with manual kubectl restarts. This would also make it easier to track which restarts were initiated by the descheduler versus manual operations.
| rs, err := pe.client.AppsV1().ReplicaSets(pod.Namespace).Get(ctx, rsName, metav1.GetOptions{}) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to get ReplicaSet %s/%s: %v", pod.Namespace, rsName, err) | ||
| } | ||
|
|
||
| var deployName string | ||
| for _, ref := range rs.OwnerReferences { | ||
| if ref.Kind == "Deployment" { | ||
| deployName = ref.Name | ||
| break | ||
| } | ||
| } | ||
| if deployName == "" { | ||
| return nil, nil | ||
| } | ||
|
|
||
| deploy, err := pe.client.AppsV1().Deployments(pod.Namespace).Get(ctx, deployName, metav1.GetOptions{}) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to get Deployment %s/%s: %v", pod.Namespace, deployName, err) | ||
| } |
There was a problem hiding this comment.
When the ReplicaSet or Deployment is not found (NotFound errors), the function returns an error which causes a fallback to normal eviction. However, NotFound errors for ReplicaSets or Deployments could be transient (e.g., during deletion) or permanent (orphaned pods). Consider distinguishing between NotFound errors and other errors: for NotFound errors, you might want to return (nil, nil) to indicate "not a Deployment pod" rather than an error, so the log message would be more accurate ("not owned by a Deployment" rather than "failed to resolve").
| } | ||
| isRecreate := deploy.Spec.Strategy.Type == appsv1.RecreateDeploymentStrategyType | ||
|
|
||
| if replicas != 1 || isRecreate || deploy.Status.UnavailableReplicas != 0 { |
There was a problem hiding this comment.
The check for deploy.Status.UnavailableReplicas != 0 may not be sufficient to detect unhealthy deployments. Consider also checking if the deployment is currently progressing (e.g., by examining deploy.Status.Conditions for the "Progressing" condition with status "False" or reason "ProgressDeadlineExceeded"). A deployment could have UnavailableReplicas == 0 but still be in a failed state if the previous rollout hasn't completed successfully.
| metrics.PodsEvictedTotal.With(map[string]string{"result": "success", "strategy": opts.StrategyName, "namespace": pod.Namespace, "node": pod.Spec.NodeName, "profile": opts.ProfileName}).Inc() | ||
| } | ||
|
|
||
| method := "eviction" | ||
| if pe.lastRolloutRestart { | ||
| method = "rollout-restart" | ||
| } | ||
|
|
||
| if pe.dryRun { | ||
| klog.V(1).InfoS("Evicted pod in dry run mode", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName) | ||
| klog.V(1).InfoS("Evicted pod in dry run mode", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName, "method", method) | ||
| } else { | ||
| klog.V(1).InfoS("Evicted pod", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName) | ||
| klog.V(1).InfoS("Evicted pod", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName, "method", method) | ||
| reason := opts.Reason | ||
| if len(reason) == 0 { | ||
| reason = opts.StrategyName |
There was a problem hiding this comment.
The rollout restart increments the eviction counters (nodePodCount, namespacePodCount, totalPodCount) even though no actual pod eviction occurs. This could be misleading for metrics and limits enforcement. Consider either: (1) not incrementing these counters for rollout restarts since no pod is immediately evicted, or (2) documenting this behavior clearly, as the rollout restart will eventually cause a pod to be terminated by the Deployment controller but at a different time than a direct eviction.
| deployKey := deploy.Namespace + "/" + deploy.Name | ||
| if pe.restartedDeployments[deployKey] { | ||
| klog.V(3).InfoS("Deployment already rollout-restarted this cycle, skipping", "deployment", deployKey, "pod", klog.KObj(pod)) | ||
| return true, true |
There was a problem hiding this comment.
When a deployment has already been rollout-restarted in this cycle and a second pod from the same deployment is processed, the function returns (true, true) meaning "handled=true, ignore=true". However, in the calling code (EvictPod), when ignore=true, the counters are NOT incremented (line 549-550 returns early). This creates an inconsistency: the first pod increments counters, but subsequent pods from the same deployment don't. This could lead to under-counting of affected pods in scenarios where multiple pods from a single-replica deployment are candidates for eviction.
| return true, true | |
| return true, false |
| } | ||
| } | ||
| pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod eviction from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName) | ||
| if pe.lastRolloutRestart { |
There was a problem hiding this comment.
Is it always guaranteed pe.lastRolloutRestart was set by pod.Namespace/pod.Name?
| } | ||
| pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod eviction from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName) | ||
| if pe.lastRolloutRestart { | ||
| pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod rollout-restarted (single-replica) from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName) |
There was a problem hiding this comment.
Annotating the corresponding deployment does not always guarantee the pod will get rolled out. Recording the event might be confusing.
| } | ||
| } | ||
| klog.V(1).InfoS("Triggered rollout restart for single-replica Deployment instead of eviction", "deployment", deployKey, "pod", klog.KObj(pod), "dryRun", pe.dryRun) | ||
| pe.restartedDeployments[deployKey] = true |
There was a problem hiding this comment.
The older deployment keys are not getting garbage collected.
EDIT: ResetCounters() resets it.
There was a problem hiding this comment.
This might not work well when the descheduling cycle is short.
|
In general the descheduler does not assume anything about a deployment's lifecycle. A pod is either evicted or it is not. What if annotating a deployment is not enough? What if there are other requirements to be met before a pod gets rolled out? The recommended solution here is to build a validating admission webhook that can handle all these concerns. I understand the current use case is a deployment with a single replica with goal to trigger the rolling update. Yet, not even this operation is atomic. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@ingvagabund some context for why this matters to us, then a response to the design question. We run on-demand environments where ~60 pods are scheduled at once. They frequently land on a single node and overload it. On the validating webhook recommendation - I want to make sure I'm not missing something, because as I understand it the webhook path doesn't fully replace this PR:
So the trade is "~600 lines additive in descheduler with fall-through-to-eviction on any failure" vs. "a webhook + controller deployment, with new failure modes for cluster eviction." I'd argue the former is the lighter footprint for users. That said - your concerns about lifecycle assumptions and non-atomicity are fair, and I'm willing to address them concretely:
Would the opt-in + pre-flight + the inline fixes be acceptable to you? If you'd still rather see this live entirely outside descheduler, I'll close in favor of that path - but I want to confirm that's the direction before reshaping the PR. |
Description
For Deployments with replicas=1 and RollingUpdate strategy, trigger a rollout restart (patch pod template annotation) instead of using the Pod Eviction API. This avoids downtime by letting the Deployment controller create a new pod before terminating the old one.
Falls through to normal eviction on errors, non-Deployment pods, multi-replica Deployments, or Recreate strategy.
Fixes #786 #1558
Checklist
Please ensure your pull request meets the following criteria before submitting
for review, these items will be used by reviewers to assess the quality and
completeness of your changes: