fix(removefailedpods): allow opt-in eviction of failed system-critical pods by jawwad-ali · Pull Request #1865 · kubernetes-sigs/descheduler

jawwad-ali · 2026-05-02T06:25:09Z

Description

The RemoveFailedPods plugin chains handle.Evictor().Filter into its candidate-listing filter, which protects all system-critical-priority pods regardless of pod.Status.Phase. As reported in #1775, this prevents cleanup of Failed-phase system-critical pods (e.g. a stuck cert-manager pod). Disabling SystemCriticalPods protection at the DefaultEvictor level is not an acceptable workaround because it would also expose living system-critical pods to eviction.

This PR adds an opt-in IncludingSystemCriticalPods bool field to RemoveFailedPodsArgs. When the operator explicitly sets it, the plugin replaces handle.Evictor().Filter with an unconditional permit at candidate-listing time. The eviction is still gated by:

the existing PodFailed phase check
validateCanEvict (Reasons, ExitCodes, MinPodLifetimeSeconds, ExcludeOwnerKinds, namespaces, labels)
handle.Evictor().PreEvictionFilter

Default behaviour is unchanged — both living and failed system-critical pods stay protected unless the operator opts in.

This addresses @googs1025's "by design" comment on the issue: the framework-level protection stays the framework default; the new flag is a per-plugin, per-operator escape valve scoped specifically to the cleanup-stuck-failed-pods use case, without the broader blast radius of disabling SystemCriticalPods in DefaultEvictor.

Posted the design proposal on the issue thread first (#1775 (comment)) — happy to make the toggle finer-grained (e.g. a per-protection struct) if reviewers prefer.

Trade-off

The flag also bypasses the framework's other default protections (DaemonSet, StaticPod, LocalStorage, FailedBarePods) for failed pods when set. The argument: failed pods are by definition not running, so DaemonSet/StaticPod semantics don't apply the same way, and the operator has explicitly opted in. If reviewers prefer to keep those protections enforced even when this flag is set, the implementation can be tightened to bypass only the system-critical check.

Test plan

go test ./pkg/framework/plugins/removefailedpods/... -v — all existing 25 cases pass plus 2 new cases:

system-critical priority failed pod is protected by default (asserts default behaviour is unchanged)
includingSystemCriticalPods=true, system-critical priority failed pod is evicted (asserts the opt-in works)

go vet ./pkg/framework/plugins/removefailedpods/... clean. go build ./... clean.

Checklist

k8s-ci-robot · 2026-05-02T06:25:19Z

Welcome @jawwad-ali!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-05-02T06:25:20Z

Hi @jawwad-ali. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Adds IncludingSystemCriticalPods to RemoveFailedPodsArgs. When the operator explicitly sets the flag, the RemoveFailedPods plugin bypasses the framework default-evictor filter at candidate-listing time, allowing failed pods that carry system-critical priority to be considered for eviction. The PodFailed phase check, validateCanEvict (Reasons/ExitCodes/MinPodLifetimeSeconds/ExcludeOwnerKinds), and PreEvictionFilter still gate eviction. Default behaviour is unchanged: living and failed system-critical pods stay protected unless the operator opts in.

googs1025 · 2026-05-02T07:02:40Z

/ok-to-test

googs1025 · 2026-05-05T01:03:53Z

 type RemoveFailedPodsArgs struct {
 	metav1.TypeMeta `json:",inline"`

-	Namespaces              *api.Namespaces       `json:"namespaces,omitempty"`
-	LabelSelector           *metav1.LabelSelector `json:"labelSelector,omitempty"`
-	ExcludeOwnerKinds       []string              `json:"excludeOwnerKinds,omitempty"`
-	MinPodLifetimeSeconds   *uint                 `json:"minPodLifetimeSeconds,omitempty"`
-	Reasons                 []string              `json:"reasons,omitempty"`
-	ExitCodes               []int32               `json:"exitCodes,omitempty"`
-	IncludingInitContainers bool                  `json:"includingInitContainers,omitempty"`
+	Namespaces                  *api.Namespaces       `json:"namespaces,omitempty"`
+	LabelSelector               *metav1.LabelSelector `json:"labelSelector,omitempty"`
+	ExcludeOwnerKinds           []string              `json:"excludeOwnerKinds,omitempty"`
+	MinPodLifetimeSeconds       *uint                 `json:"minPodLifetimeSeconds,omitempty"`
+	Reasons                     []string              `json:"reasons,omitempty"`
+	ExitCodes                   []int32               `json:"exitCodes,omitempty"`
+	IncludingInitContainers     bool                  `json:"includingInitContainers,omitempty"`
+	IncludingSystemCriticalPods bool                  `json:"includingSystemCriticalPods,omitempty"`
 }


Small consistency thing: we already have EvictSystemCriticalPods on DefaultEvictor, and the docs/issues all use that wording. Could we name this one along the same lines — e.g.
EvictFailedSystemCriticalPods — so users don't have to keep two mental models for what's essentially the same concept at different scopes?

Done in f05b631 — renamed to EvictFailedSystemCriticalPods (and evictFailedSystemCriticalPods JSON tag) so it parallels EvictSystemCriticalPods on DefaultEvictor.

googs1025 · 2026-05-05T01:08:16Z

@@ -25,11 +25,12 @@ import (
 type RemoveFailedPodsArgs struct {


Have you double-checked that the new field is wired through any versioned API types as well? If RemoveFailedPodsArgs has a counterpart under pkg/api/v1alphaX/, the YAML decode path will silently drop includingSystemCriticalPods and the flag will look like it does nothing in real configs even though tests pass. Probably fine, but worth a grep -r RemoveFailedPodsArgs pkg/api/ to confirm.

Checked — grep -rn RemoveFailedPodsArgs pkg/api/ returns nothing. Descheduler doesn't use the apimachinery per-plugin versioned-types pattern; plugin args are registered via pluginregistry.Register and decoded through pluginArgConversionScheme with the internal RemoveFailedPodsArgs as the target type, so the new field flows through the YAML decode path without a v1alpha2 shadow. The existing zz_generated.deepcopy.go uses *out = *in which already covers the new bool.

googs1025 · 2026-05-05T01:13:52Z

Reading the code I think this flag as written is named IncludingSystemCriticalPods but actually replaces the whole
DefaultEvictor.Filter (DaemonSet/StaticPod/LocalStorage all fall away too) — but more importantly, even if we tighten the implementation, putting this knob at the plugin level creates a precedence problem: a plugin-level flag would be able to override the framework-level EvictSystemCriticalPods decision, which inverts the usual "framework is the ceiling, plugins can only be more restrictive" convention.
If we ever later add a framework-level EvictFailedSystemCriticalPods to address the same need at the right layer, the two flags would conflict in confusing ways (plugin says yes, framework says no — who wins?).

Address review feedback on kubernetes-sigs#1865: - Rename IncludingSystemCriticalPods to EvictFailedSystemCriticalPods to align with the existing EvictSystemCriticalPods wording on DefaultEvictor and avoid two mental models for the same concept at different scopes. - Tighten the filter override so only pods that are both PodFailed and system-critical bypass the framework default-evictor filter. All other DefaultEvictor protections (DaemonSet, StaticPod, LocalStorage, FailedBarePods) continue to apply for failed pods, closing the over-broad bypass that was previously documented as a trade-off in the PR body.

k8s-ci-robot · 2026-05-05T15:52:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ingvagabund. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jawwad-ali · 2026-05-05T15:53:47Z

Thanks for the careful read @googs1025 — both points landed.

Over-broad bypass: fixed in f05b631. The filter override is now scoped to pods that are both PodFailed and utils.IsCriticalPriorityPod; everything else (DaemonSet, StaticPod, LocalStorage, FailedBarePods) still goes through handle.Evictor().Filter unchanged. The PR body's earlier "this also bypasses other protections" trade-off no longer applies.

Precedence inversion: this one is worth a real decision before I push more code, so I want to ask rather than guess.

The remaining concern as I read it: even with the implementation tightened, a plugin-level EvictFailedSystemCriticalPods knob lets a plugin override a framework-level EvictSystemCriticalPods=false decision, which inverts the usual "framework is the ceiling, plugins can only narrow" convention. If you later add a framework-level EvictFailedSystemCriticalPods to DefaultEvictorArgs, the two flags collide.

Two paths from here, happy to take whichever you prefer:

Keep this PR plugin-scoped, document the precedence interaction (plugin opt-in implies overriding the framework's SystemCriticalPods protection for failed pods only), and treat a future framework-level toggle as an additive change rather than a conflict (when present, the framework-level flag is the source of truth and the plugin-level one becomes a no-op or is deprecated).
Move the toggle to the framework — close this PR (or rebuild it), add EvictFailedSystemCriticalPods to DefaultEvictorArgs, and have applySystemCriticalPodsProtection skip its check when pod.Status.Phase == v1.PodFailed && args.EvictFailedSystemCriticalPods. RemoveFailedPods then needs no plugin-level flag at all; users opt in once at the evictor level. This is the architecturally cleaner direction and aligns with the existing EvictSystemCriticalPods knob's location.

I'm leaning toward (2) because it sidesteps the precedence-inversion question entirely and the call-site reads more naturally (evictFailedSystemCriticalPods lives next to evictSystemCriticalPods in the same args block). Path (1) keeps this PR small but inherits the issue you raised.

Which way do you want me to take it? Happy to redo this PR end-to-end at the framework level if (2) is the answer.

ingvagabund · 2026-05-05T19:30:06Z

Dropped comment in #1775 (comment)

jawwad-ali · 2026-05-05T19:33:03Z

Thanks @ingvagabund — that profile-scoped workaround is a great pointer; I hadn't fully appreciated that disabling SystemCriticalPods in a dedicated RemoveFailedPods-only profile sidesteps the issue today without any code change.

Your last line is the part I want to make sure I'm reading right:

Every pod identified by RemoveFailedPods plugin as evictable is a failed pod by the definition. So the SystemCriticalPods gating can be lifted.

Taken at face value, that argument applies just as cleanly to the plugin's own filter as it does to a user-supplied profile: if every candidate the plugin will ever evict is, by definition, in PodFailed, the framework-level SystemCriticalPods gate is logically redundant for this plugin. Which opens a third option I hadn't considered:

Option C — unconditional lift, no flag. Drop EvictFailedSystemCriticalPods from RemoveFailedPodsArgs entirely; in New(), the filter for RemoveFailedPods always bypasses the SystemCriticalPods check (only — DaemonSet, StaticPod, LocalStorage, FailedBarePods stay enforced). Smallest possible diff, no precedence-inversion question, and the behaviour matches your "by definition" argument.

That sidesteps both your point and @googs1025's earlier precedence-inversion concern, at the cost of a silent behaviour change for any operator currently relying on SystemCriticalPods to also block failed system-critical pods from RemoveFailedPods (release-note worthy, but I'd argue arguably a bug fix rather than a breaking change, given the plugin's contract).

So three paths I can see, want me to take whichever you and @googs1025 converge on:

Close this PR, document the dedicated-profile workaround in the issue / README, leave plugin code untouched.
Option C above — unconditional lift, no flag. Smallest code, behaviour change called out in release notes.
Keep this PR's shape (opt-in EvictFailedSystemCriticalPods flag, tightened to system-critical-only as in f05b631a) — preserves backwards compatibility but accepts the precedence-inversion @googs1025 flagged.

I lean toward (2) given your "by definition" framing — it's the cleanest read of the plugin's contract — but happy to go (1) or (3) if either of you sees a reason I'm missing.

ingvagabund · 2026-05-05T19:40:26Z

The current policy API is designed to allow customization of what is considered a failed pod and whether even failed critical pods should be evicted or not. It's perfectly fine to keep the defaults and not evicting failed critical pods as in some cases a different eviction mechanism might be needed to deployed due to different e.g. lifecycle requirements. E.g. some application installation may have different company policies for evicting critical pods.

ingvagabund · 2026-05-05T19:42:02Z

Maybe I misunderstood the description of the issue in #1775. Yet, the goal is to allow eviction of failed critical pods by RemoveFailedPods plugin. Which is already possible by configuring two profiles each with different DefaultEvictor arguments.

jawwad-ali · 2026-05-05T19:44:59Z

Thanks for the clarification @ingvagabund — that's a useful correction. I over-read your earlier "the SystemCriticalPods gating can be lifted" as a code-change recommendation; in context (and reading your two follow-up messages together) it's clearly justification for why the dedicated-profile workaround is safe, not a steer to change the plugin's filter chain.

Reflecting your position back to make sure I have it right:

The current policy API is intentional — defaulting to protect failed critical pods is the right default because operators have varying compliance/lifecycle policies.
The use-case in Failed Pods with system critical priority are note removed by RemoveFailedPods plugin #1775 is already addressable today by configuring a RemoveFailedPods-only profile with SystemCriticalPods in PodProtections.DefaultDisabled.
The plugin's filter chain therefore doesn't need a per-plugin escape valve.

If that read is correct, the right outcome is for me to close this PR and (optionally) leave a short comment on #1775 pointing the reporter at your YAML example so the issue can also be closed. Happy to do both.

Want me to wait for @googs1025 to weigh in before closing, or should I close it now? Either is fine — just want to follow whichever cadence the team prefers.

ingvagabund · 2026-05-06T10:01:56Z

As long as you can validate the suggested existing solution works there's nothing else to do. Maybe adding an example for the plugin under https://github.com/kubernetes-sigs/descheduler#removefailedpods? At this point it is up to @alex-berger to tell us whether the suggested solution works.

alex-berger · 2026-05-06T10:32:22Z

The suggested solution looks good to me and once it is released I, I can test it on our clusters.

ingvagabund · 2026-05-06T10:40:24Z

The currently suggested solution is already part of the releases: #1775 (comment). Can you please take a look and see if it works for your use case?

jawwad-ali · 2026-05-06T15:37:20Z

Got it @ingvagabund — clear direction. I'll leave this PR open so it stays the natural place to revisit if @alex-berger hits anything unexpected with the workaround, and close it once he confirms it works on his clusters (also happy to close earlier if you'd rather not have it sit).

On the README example — happy to send a small, separate docs-only PR adding the dedicated-RemoveFailedPods-profile YAML you posted to the RemoveFailedPods section of the README, so future operators hitting #1775's symptom find the pattern documented next to the plugin itself. Want me to:

Open that docs PR now (so it can be reviewed in parallel with @alex-berger's validation), or
Wait until he confirms the workaround works, then open the docs PR?

Either is fine — flag whichever fits your review cadence.

k8s-ci-robot · 2026-05-18T11:49:26Z

@jawwad-ali: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-descheduler-test-e2e-k8s-master-1-36	`f05b631`	link	true	`/test pull-descheduler-test-e2e-k8s-master-1-36`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot requested review from damemi and googs1025 May 2, 2026 06:25

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 2, 2026

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 2, 2026

jawwad-ali force-pushed the fix/1775-remove-failed-system-critical-pods branch from d78b789 to f7f585c Compare May 2, 2026 06:57

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 2, 2026

ingvagabund self-assigned this May 4, 2026

googs1025 reviewed May 5, 2026

View reviewed changes

		@@ -25,11 +25,12 @@ import (
		type RemoveFailedPodsArgs struct {

Conversation

jawwad-ali commented May 2, 2026

Description

Trade-off

Test plan

Checklist

Uh oh!

k8s-ci-robot commented May 2, 2026

Uh oh!

k8s-ci-robot commented May 2, 2026

Uh oh!

googs1025 commented May 2, 2026

Uh oh!

googs1025 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jawwad-ali May 5, 2026

Choose a reason for hiding this comment

Uh oh!

googs1025 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jawwad-ali May 5, 2026

Choose a reason for hiding this comment

Uh oh!

googs1025 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented May 5, 2026

Uh oh!

jawwad-ali commented May 5, 2026

Uh oh!

ingvagabund commented May 5, 2026

Uh oh!

jawwad-ali commented May 5, 2026

Uh oh!

ingvagabund commented May 5, 2026

Uh oh!

ingvagabund commented May 5, 2026

Uh oh!

jawwad-ali commented May 5, 2026

Uh oh!

ingvagabund commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alex-berger commented May 6, 2026

Uh oh!

ingvagabund commented May 6, 2026

Uh oh!

jawwad-ali commented May 6, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

googs1025 commented May 5, 2026 •

edited

Loading

ingvagabund commented May 6, 2026 •

edited

Loading