Skip to content

feat: rollout restart single-replica Deployments instead of evicting#1841

Open
dmitriy-myz wants to merge 1 commit into
kubernetes-sigs:masterfrom
dmitriy-myz:feature/rollout-restart-single-replica
Open

feat: rollout restart single-replica Deployments instead of evicting#1841
dmitriy-myz wants to merge 1 commit into
kubernetes-sigs:masterfrom
dmitriy-myz:feature/rollout-restart-single-replica

Conversation

@dmitriy-myz
Copy link
Copy Markdown

Description

For Deployments with replicas=1 and RollingUpdate strategy, trigger a rollout restart (patch pod template annotation) instead of using the Pod Eviction API. This avoids downtime by letting the Deployment controller create a new pod before terminating the old one.

Falls through to normal eviction on errors, non-Deployment pods, multi-replica Deployments, or Recreate strategy.
Fixes #786 #1558

Checklist

Please ensure your pull request meets the following criteria before submitting
for review, these items will be used by reviewers to assess the quality and
completeness of your changes:

  • Code Readability: Is the code easy to understand, well-structured, and consistent with project conventions?
  • Naming Conventions: Are variable, function, and structs descriptive and consistent?
  • Code Duplication: Is there any repeated code that should be refactored?
  • Function/Method Size: Are functions/methods short and focused on a single task?
  • Comments & Documentation: Are comments clear, useful, and not excessive? Were comments updated where necessary?
  • Error Handling: Are errors handled appropriately ?
  • Testing: Are there sufficient unit/integration tests?
  • Performance: Are there any obvious performance issues or unnecessary computations?
  • Dependencies: Are new dependencies justified ?
  • Logging & Monitoring: Is logging used appropriately (not too verbose, not too silent)?
  • Backward Compatibility: Does this change break any existing functionality or APIs?
  • Resource Management: Are resources (files, connections, memory) managed and released properly?
  • PR Description: Is the PR description clear, providing enough context and explaining the motivation for the change?
  • Documentation & Changelog: Are README and docs updated if necessary?

For Deployments with replicas=1 and RollingUpdate strategy, trigger a
rollout restart (patch pod template annotation) instead of using the
Pod Eviction API. This avoids downtime by letting the Deployment
controller create a new pod before terminating the old one.

Falls through to normal eviction on errors, non-Deployment pods,
multi-replica Deployments, or Recreate strategy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 24, 2026 08:38
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Feb 24, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: dmitriy-myz / name: Dmitry Muzyka (356a864)

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign knelasevero for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @dmitriy-myz!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @dmitriy-myz. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 24, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a rollout restart mechanism for single-replica Deployments instead of using the Pod Eviction API. The goal is to avoid downtime during descheduling by allowing the Deployment controller to create a new pod before terminating the old one, addressing issues #786 and #1558.

Changes:

  • Modified eviction logic to detect single-replica Deployments with RollingUpdate strategy and trigger a rollout restart instead of direct eviction
  • Added RBAC permissions for reading ReplicaSets and Deployments, and patching Deployments
  • Added comprehensive test coverage for various scenarios including multi-replica, non-Deployment pods, Recreate strategy, unhealthy deployments, and deduplication

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/descheduler/evictions/evictions.go Implements rollout restart logic including owner chain resolution (Pod → ReplicaSet → Deployment), deployment patching, and integration with existing eviction flow
pkg/descheduler/evictions/evictions_test.go Adds comprehensive test coverage for rollout restart feature covering edge cases like multi-replica, non-Deployment pods, Recreate strategy, deduplication, dry-run mode, and unhealthy deployments
pkg/descheduler/kubeclientsandbox.go Registers ReplicaSets and Deployments informers in the sandbox for dry-run mode support
kubernetes/base/rbac.yaml Adds ClusterRole permissions for reading ReplicaSets and Deployments, and patching Deployments
charts/descheduler/templates/clusterrole.yaml Adds ClusterRole permissions for reading ReplicaSets and Deployments, and patching Deployments (Helm chart)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// rolloutRestartDeployment patches the Deployment's pod template annotation to trigger a rolling restart.
func (pe *PodEvictor) rolloutRestartDeployment(ctx context.Context, deploy *appsv1.Deployment) error {
patch := fmt.Sprintf(`{"spec":{"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/restartedAt":"%s"}}}}}`, time.Now().Format(time.RFC3339))
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation key "kubectl.kubernetes.io/restartedAt" is used here to trigger a rollout restart, which matches the convention used by kubectl rollout restart. However, consider whether descheduler should use its own annotation key (e.g., "descheduler.sigs.k8s.io/restartedAt") to clearly indicate the source of the restart and avoid confusion with manual kubectl restarts. This would also make it easier to track which restarts were initiated by the descheduler versus manual operations.

Copilot uses AI. Check for mistakes.
Comment on lines +603 to +622
rs, err := pe.client.AppsV1().ReplicaSets(pod.Namespace).Get(ctx, rsName, metav1.GetOptions{})
if err != nil {
return nil, fmt.Errorf("failed to get ReplicaSet %s/%s: %v", pod.Namespace, rsName, err)
}

var deployName string
for _, ref := range rs.OwnerReferences {
if ref.Kind == "Deployment" {
deployName = ref.Name
break
}
}
if deployName == "" {
return nil, nil
}

deploy, err := pe.client.AppsV1().Deployments(pod.Namespace).Get(ctx, deployName, metav1.GetOptions{})
if err != nil {
return nil, fmt.Errorf("failed to get Deployment %s/%s: %v", pod.Namespace, deployName, err)
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the ReplicaSet or Deployment is not found (NotFound errors), the function returns an error which causes a fallback to normal eviction. However, NotFound errors for ReplicaSets or Deployments could be transient (e.g., during deletion) or permanent (orphaned pods). Consider distinguishing between NotFound errors and other errors: for NotFound errors, you might want to return (nil, nil) to indicate "not a Deployment pod" rather than an error, so the log message would be more accurate ("not owned by a Deployment" rather than "failed to resolve").

Copilot uses AI. Check for mistakes.
}
isRecreate := deploy.Spec.Strategy.Type == appsv1.RecreateDeploymentStrategyType

if replicas != 1 || isRecreate || deploy.Status.UnavailableReplicas != 0 {
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for deploy.Status.UnavailableReplicas != 0 may not be sufficient to detect unhealthy deployments. Consider also checking if the deployment is currently progressing (e.g., by examining deploy.Status.Conditions for the "Progressing" condition with status "False" or reason "ProgressDeadlineExceeded"). A deployment could have UnavailableReplicas == 0 but still be in a failed state if the previous rollout hasn't completed successfully.

Copilot uses AI. Check for mistakes.
Comment on lines 561 to 575
metrics.PodsEvictedTotal.With(map[string]string{"result": "success", "strategy": opts.StrategyName, "namespace": pod.Namespace, "node": pod.Spec.NodeName, "profile": opts.ProfileName}).Inc()
}

method := "eviction"
if pe.lastRolloutRestart {
method = "rollout-restart"
}

if pe.dryRun {
klog.V(1).InfoS("Evicted pod in dry run mode", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName)
klog.V(1).InfoS("Evicted pod in dry run mode", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName, "method", method)
} else {
klog.V(1).InfoS("Evicted pod", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName)
klog.V(1).InfoS("Evicted pod", "pod", klog.KObj(pod), "reason", opts.Reason, "strategy", opts.StrategyName, "node", pod.Spec.NodeName, "profile", opts.ProfileName, "method", method)
reason := opts.Reason
if len(reason) == 0 {
reason = opts.StrategyName
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollout restart increments the eviction counters (nodePodCount, namespacePodCount, totalPodCount) even though no actual pod eviction occurs. This could be misleading for metrics and limits enforcement. Consider either: (1) not incrementing these counters for rollout restarts since no pod is immediately evicted, or (2) documenting this behavior clearly, as the rollout restart will eventually cause a pod to be terminated by the Deployment controller but at a different time than a direct eviction.

Copilot uses AI. Check for mistakes.
deployKey := deploy.Namespace + "/" + deploy.Name
if pe.restartedDeployments[deployKey] {
klog.V(3).InfoS("Deployment already rollout-restarted this cycle, skipping", "deployment", deployKey, "pod", klog.KObj(pod))
return true, true
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a deployment has already been rollout-restarted in this cycle and a second pod from the same deployment is processed, the function returns (true, true) meaning "handled=true, ignore=true". However, in the calling code (EvictPod), when ignore=true, the counters are NOT incremented (line 549-550 returns early). This creates an inconsistency: the first pod increments counters, but subsequent pods from the same deployment don't. This could lead to under-counting of affected pods in scenarios where multiple pods from a single-replica deployment are candidates for eviction.

Suggested change
return true, true
return true, false

Copilot uses AI. Check for mistakes.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 24, 2026
}
}
pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod eviction from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName)
if pe.lastRolloutRestart {
Copy link
Copy Markdown
Contributor

@ingvagabund ingvagabund May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always guaranteed pe.lastRolloutRestart was set by pod.Namespace/pod.Name?

}
pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod eviction from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName)
if pe.lastRolloutRestart {
pe.eventRecorder.Eventf(pod, nil, v1.EventTypeNormal, reason, "Descheduled", "pod rollout-restarted (single-replica) from %v node by sigs.k8s.io/descheduler", pod.Spec.NodeName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotating the corresponding deployment does not always guarantee the pod will get rolled out. Recording the event might be confusing.

}
}
klog.V(1).InfoS("Triggered rollout restart for single-replica Deployment instead of eviction", "deployment", deployKey, "pod", klog.KObj(pod), "dryRun", pe.dryRun)
pe.restartedDeployments[deployKey] = true
Copy link
Copy Markdown
Contributor

@ingvagabund ingvagabund May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The older deployment keys are not getting garbage collected.

EDIT: ResetCounters() resets it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not work well when the descheduling cycle is short.

@ingvagabund
Copy link
Copy Markdown
Contributor

In general the descheduler does not assume anything about a deployment's lifecycle. A pod is either evicted or it is not. What if annotating a deployment is not enough? What if there are other requirements to be met before a pod gets rolled out? The recommended solution here is to build a validating admission webhook that can handle all these concerns.

I understand the current use case is a deployment with a single replica with goal to trigger the rolling update. Yet, not even this operation is atomic.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dmitriy-myz
Copy link
Copy Markdown
Author

@ingvagabund some context for why this matters to us, then a response to the design question.

We run on-demand environments where ~60 pods are scheduled at once. They frequently land on a single node and overload it.
We rely on the descheduler to rebalance, and our QA tests run against these envs immediately after provisioning - so eviction-driven downtime on single-replica deployments broke the test runs. This PR is the patch we've been running internally to make rebalance safe for that workflow.

On the validating webhook recommendation - I want to make sure I'm not missing something, because as I understand it the webhook path doesn't fully replace this PR:

  • A validating webhook can block an eviction, but it can't trigger a rollout restart. Patching the Deployment still has to happen somewhere. So the webhook approach is really "webhook + a separate controller that watches blocked evictions and patches Deployments" - strictly more components than this PR, not fewer.
  • Operationally, a webhook adds a stateful HA service to every cluster running descheduler, with cluster-wide blast radius: if the webhook pod is down with failurePolicy:
    Fail, all evictions break (including kubectl drain and node upgrades). With failurePolicy: Ignore it silently does nothing, which defeats the safety property. Plus cert routine.

So the trade is "~600 lines additive in descheduler with fall-through-to-eviction on any failure" vs. "a webhook + controller deployment, with new failure modes for cluster eviction." I'd argue the former is the lighter footprint for users.

That said - your concerns about lifecycle assumptions and non-atomicity are fair, and I'm willing to address them concretely:

  1. Opt-in via a feature gate, default off. Users who don't enable it get exactly today's behavior.
  2. Pre-flight guards before patching: skip the rollout-restart path when spec.paused, maxSurge: 0, or status.ObservedGeneration != generation, and fall through to normal
    eviction.
  3. Fix the inline comments: move lastRolloutRestart off the receiver, switch the dedup map to TTL-keyed by observedGeneration so it self-expires on short cycles, and only emit the rollout event after verifying progress.

Would the opt-in + pre-flight + the inline fixes be acceptable to you? If you'd still rather see this live entirely outside descheduler, I'll close in favor of that path - but I want to confirm that's the direction before reshaping the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

descheduler when ReplicaSet=1

5 participants