feat: rollout restart single-replica Deployments instead of evicting by dmitriy-myz · Pull Request #1841 · kubernetes-sigs/descheduler

dmitriy-myz · 2026-02-24T08:38:07Z

Description

For Deployments with replicas=1 and RollingUpdate strategy, trigger a rollout restart (patch pod template annotation) instead of using the Pod Eviction API. This avoids downtime by letting the Deployment controller create a new pod before terminating the old one.

Falls through to normal eviction on errors, non-Deployment pods, multi-replica Deployments, or Recreate strategy.
Fixes #786 #1558

Checklist

Please ensure your pull request meets the following criteria before submitting
for review, these items will be used by reviewers to assess the quality and
completeness of your changes:

For Deployments with replicas=1 and RollingUpdate strategy, trigger a rollout restart (patch pod template annotation) instead of using the Pod Eviction API. This avoids downtime by letting the Deployment controller create a new pod before terminating the old one. Falls through to normal eviction on errors, non-Deployment pods, multi-replica Deployments, or Recreate strategy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

linux-foundation-easycla · 2026-02-24T08:38:14Z

The committers listed above are authorized under a signed CLA.

✅ login: dmitriy-myz / name: Dmitry Muzyka (356a864)

k8s-ci-robot · 2026-02-24T08:38:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign knelasevero for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-02-24T08:38:16Z

Welcome @dmitriy-myz!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-02-24T08:38:16Z

Hi @dmitriy-myz. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copilot

Pull request overview

This pull request implements a rollout restart mechanism for single-replica Deployments instead of using the Pod Eviction API. The goal is to avoid downtime during descheduling by allowing the Deployment controller to create a new pod before terminating the old one, addressing issues #786 and #1558.

Changes:

Modified eviction logic to detect single-replica Deployments with RollingUpdate strategy and trigger a rollout restart instead of direct eviction
Added RBAC permissions for reading ReplicaSets and Deployments, and patching Deployments
Added comprehensive test coverage for various scenarios including multi-replica, non-Deployment pods, Recreate strategy, unhealthy deployments, and deduplication

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
pkg/descheduler/evictions/evictions.go	Implements rollout restart logic including owner chain resolution (Pod → ReplicaSet → Deployment), deployment patching, and integration with existing eviction flow
pkg/descheduler/evictions/evictions_test.go	Adds comprehensive test coverage for rollout restart feature covering edge cases like multi-replica, non-Deployment pods, Recreate strategy, deduplication, dry-run mode, and unhealthy deployments
pkg/descheduler/kubeclientsandbox.go	Registers ReplicaSets and Deployments informers in the sandbox for dry-run mode support
kubernetes/base/rbac.yaml	Adds ClusterRole permissions for reading ReplicaSets and Deployments, and patching Deployments
charts/descheduler/templates/clusterrole.yaml	Adds ClusterRole permissions for reading ReplicaSets and Deployments, and patching Deployments (Helm chart)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T08:43:48Z

+
+// rolloutRestartDeployment patches the Deployment's pod template annotation to trigger a rolling restart.
+func (pe *PodEvictor) rolloutRestartDeployment(ctx context.Context, deploy *appsv1.Deployment) error {
+	patch := fmt.Sprintf(`{"spec":{"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"%s"}}}}}`, time.Now().Format(time.RFC3339))


The annotation key "kubectl.kubernetes.io/restartedAt" is used here to trigger a rollout restart, which matches the convention used by kubectl rollout restart. However, consider whether descheduler should use its own annotation key (e.g., "descheduler.sigs.k8s.io/restartedAt") to clearly indicate the source of the restart and avoid confusion with manual kubectl restarts. This would also make it easier to track which restarts were initiated by the descheduler versus manual operations.

Copilot · 2026-02-24T08:43:49Z

+	rs, err := pe.client.AppsV1().ReplicaSets(pod.Namespace).Get(ctx, rsName, metav1.GetOptions{})
+	if err != nil {
+		return nil, fmt.Errorf("failed to get ReplicaSet %s/%s: %v", pod.Namespace, rsName, err)
+	}
+
+	var deployName string
+	for _, ref := range rs.OwnerReferences {
+		if ref.Kind == "Deployment" {
+			deployName = ref.Name
+			break
+		}
+	}
+	if deployName == "" {
+		return nil, nil
+	}
+
+	deploy, err := pe.client.AppsV1().Deployments(pod.Namespace).Get(ctx, deployName, metav1.GetOptions{})
+	if err != nil {
+		return nil, fmt.Errorf("failed to get Deployment %s/%s: %v", pod.Namespace, deployName, err)
+	}


When the ReplicaSet or Deployment is not found (NotFound errors), the function returns an error which causes a fallback to normal eviction. However, NotFound errors for ReplicaSets or Deployments could be transient (e.g., during deletion) or permanent (orphaned pods). Consider distinguishing between NotFound errors and other errors: for NotFound errors, you might want to return (nil, nil) to indicate "not a Deployment pod" rather than an error, so the log message would be more accurate ("not owned by a Deployment" rather than "failed to resolve").

Copilot · 2026-02-24T08:43:49Z

+	}
+	isRecreate := deploy.Spec.Strategy.Type == appsv1.RecreateDeploymentStrategyType
+
+	if replicas != 1 || isRecreate || deploy.Status.UnavailableReplicas != 0 {


The check for deploy.Status.UnavailableReplicas != 0 may not be sufficient to detect unhealthy deployments. Consider also checking if the deployment is currently progressing (e.g., by examining deploy.Status.Conditions for the "Progressing" condition with status "False" or reason "ProgressDeadlineExceeded"). A deployment could have UnavailableReplicas == 0 but still be in a failed state if the previous rollout hasn't completed successfully.

Copilot · 2026-02-24T08:43:49Z

 		metrics.PodsEvictedTotal.With(map[string]string{"result": "success", "strategy": opts.StrategyName, "namespace": pod.Namespace, "node": pod.Spec.NodeName, "profile": opts.ProfileName}).Inc()
 	}

+	method := "eviction"
+	if pe.lastRolloutRestart {
+		method = "rollout-restart"
+	}
+
 	if pe.dryRun {
-		klog.V(1).InfoS("Evicted pod in dry run mode", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName)
+		klog.V(1).InfoS("Evicted pod in dry run mode", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName, "method", method)
 	} else {
-		klog.V(1).InfoS("Evicted pod", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName)
+		klog.V(1).InfoS("Evicted pod", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName, "method", method)
 		reason := opts.Reason
 		if len(reason) == 0 {
 			reason = opts.StrategyName


The rollout restart increments the eviction counters (nodePodCount, namespacePodCount, totalPodCount) even though no actual pod eviction occurs. This could be misleading for metrics and limits enforcement. Consider either: (1) not incrementing these counters for rollout restarts since no pod is immediately evicted, or (2) documenting this behavior clearly, as the rollout restart will eventually cause a pod to be terminated by the Deployment controller but at a different time than a direct eviction.

Copilot · 2026-02-24T08:43:49Z

+	deployKey := deploy.Namespace + "/" + deploy.Name
+	if pe.restartedDeployments[deployKey] {
+		klog.V(3).InfoS("Deployment already rollout-restarted this cycle, skipping", "deployment", deployKey, "pod", klog.KObj(pod))
+		return true, true


When a deployment has already been rollout-restarted in this cycle and a second pod from the same deployment is processed, the function returns (true, true) meaning "handled=true, ignore=true". However, in the calling code (EvictPod), when ignore=true, the counters are NOT incremented (line 549-550 returns early). This creates an inconsistency: the first pod increments counters, but subsequent pods from the same deployment don't. This could lead to under-counting of affected pods in scenarios where multiple pods from a single-replica deployment are candidates for eviction.

Suggested change

return true, true

return true, false

ingvagabund · 2026-05-11T11:20:34Z

 			}
 		}
-		pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod eviction from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName)
+		if pe.lastRolloutRestart {


Is it always guaranteed pe.lastRolloutRestart was set by pod.Namespace/pod.Name?

ingvagabund · 2026-05-11T11:30:45Z

 		}
-		pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod eviction from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName)
+		if pe.lastRolloutRestart {
+			pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod rollout-restarted (single-replica) from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName)


Annotating the corresponding deployment does not always guarantee the pod will get rolled out. Recording the event might be confusing.

ingvagabund · 2026-05-11T11:34:38Z

+		}
+	}
+	klog.V(1).InfoS("Triggered rollout restart for single-replica Deployment instead of eviction", "deployment", deployKey, "pod", klog.KObj(pod), "dryRun", pe.dryRun)
+	pe.restartedDeployments[deployKey] = true


The older deployment keys are not getting garbage collected.

EDIT: ResetCounters() resets it.

This might not work well when the descheduling cycle is short.

ingvagabund · 2026-05-11T11:44:06Z

In general the descheduler does not assume anything about a deployment's lifecycle. A pod is either evicted or it is not. What if annotating a deployment is not enough? What if there are other requirements to be met before a pod gets rolled out? The recommended solution here is to build a validating admission webhook that can handle all these concerns.

I understand the current use case is a deployment with a single replica with goal to trigger the rolling update. Yet, not even this operation is atomic.

k8s-ci-robot · 2026-05-11T11:44:15Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dmitriy-myz · 2026-05-14T13:24:58Z

@ingvagabund some context for why this matters to us, then a response to the design question.

We run on-demand environments where ~60 pods are scheduled at once. They frequently land on a single node and overload it.
We rely on the descheduler to rebalance, and our QA tests run against these envs immediately after provisioning - so eviction-driven downtime on single-replica deployments broke the test runs. This PR is the patch we've been running internally to make rebalance safe for that workflow.

On the validating webhook recommendation - I want to make sure I'm not missing something, because as I understand it the webhook path doesn't fully replace this PR:

A validating webhook can block an eviction, but it can't trigger a rollout restart. Patching the Deployment still has to happen somewhere. So the webhook approach is really "webhook + a separate controller that watches blocked evictions and patches Deployments" - strictly more components than this PR, not fewer.
Operationally, a webhook adds a stateful HA service to every cluster running descheduler, with cluster-wide blast radius: if the webhook pod is down with failurePolicy:
Fail, all evictions break (including kubectl drain and node upgrades). With failurePolicy: Ignore it silently does nothing, which defeats the safety property. Plus cert routine.

So the trade is "~600 lines additive in descheduler with fall-through-to-eviction on any failure" vs. "a webhook + controller deployment, with new failure modes for cluster eviction." I'd argue the former is the lighter footprint for users.

That said - your concerns about lifecycle assumptions and non-atomicity are fair, and I'm willing to address them concretely:

Opt-in via a feature gate, default off. Users who don't enable it get exactly today's behavior.
Pre-flight guards before patching: skip the rollout-restart path when spec.paused, maxSurge: 0, or status.ObservedGeneration != generation, and fall through to normal
eviction.
Fix the inline comments: move lastRolloutRestart off the receiver, switch the dedup map to TTL-keyed by observedGeneration so it self-expires on short cycles, and only emit the rollout event after verifying progress.

Would the opt-in + pre-flight + the inline fixes be acceptable to you? If you'd still rather see this live entirely outside descheduler, I'll close in favor of that path - but I want to confirm that's the direction before reshaping the PR.

Copilot AI review requested due to automatic review settings February 24, 2026 08:38

k8s-ci-robot requested review from JaneLiuL and jklaw90 February 24, 2026 08:38

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 24, 2026

Copilot started reviewing on behalf of dmitriy-myz February 24, 2026 08:38 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 24, 2026

vazyzy approved these changes Apr 1, 2026

View reviewed changes

ingvagabund reviewed May 11, 2026

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026

Conversation

dmitriy-myz commented Feb 24, 2026

Description

Checklist

Uh oh!

linux-foundation-easycla Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 24, 2026

Uh oh!

k8s-ci-robot commented Feb 24, 2026

Uh oh!

k8s-ci-robot commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

ingvagabund May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund May 11, 2026

Choose a reason for hiding this comment

Uh oh!

ingvagabund May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund May 11, 2026

Choose a reason for hiding this comment

Uh oh!

ingvagabund commented May 11, 2026

Uh oh!

k8s-ci-robot commented May 11, 2026

Uh oh!

dmitriy-myz commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linux-foundation-easycla Bot commented Feb 24, 2026 •

edited

Loading

ingvagabund May 11, 2026 •

edited

Loading

ingvagabund May 11, 2026 •

edited

Loading