Skip to content

fix(removefailedpods): allow opt-in eviction of failed system-critical pods#1865

Open
jawwad-ali wants to merge 2 commits into
kubernetes-sigs:masterfrom
jawwad-ali:fix/1775-remove-failed-system-critical-pods
Open

fix(removefailedpods): allow opt-in eviction of failed system-critical pods#1865
jawwad-ali wants to merge 2 commits into
kubernetes-sigs:masterfrom
jawwad-ali:fix/1775-remove-failed-system-critical-pods

Conversation

@jawwad-ali
Copy link
Copy Markdown

Description

Fixes #1775.

The RemoveFailedPods plugin chains handle.Evictor().Filter into its candidate-listing filter, which protects all system-critical-priority pods regardless of pod.Status.Phase. As reported in #1775, this prevents cleanup of Failed-phase system-critical pods (e.g. a stuck cert-manager pod). Disabling SystemCriticalPods protection at the DefaultEvictor level is not an acceptable workaround because it would also expose living system-critical pods to eviction.

This PR adds an opt-in IncludingSystemCriticalPods bool field to RemoveFailedPodsArgs. When the operator explicitly sets it, the plugin replaces handle.Evictor().Filter with an unconditional permit at candidate-listing time. The eviction is still gated by:

  • the existing PodFailed phase check
  • validateCanEvict (Reasons, ExitCodes, MinPodLifetimeSeconds, ExcludeOwnerKinds, namespaces, labels)
  • handle.Evictor().PreEvictionFilter

Default behaviour is unchanged — both living and failed system-critical pods stay protected unless the operator opts in.

This addresses @googs1025's "by design" comment on the issue: the framework-level protection stays the framework default; the new flag is a per-plugin, per-operator escape valve scoped specifically to the cleanup-stuck-failed-pods use case, without the broader blast radius of disabling SystemCriticalPods in DefaultEvictor.

Posted the design proposal on the issue thread first (#1775 (comment)) — happy to make the toggle finer-grained (e.g. a per-protection struct) if reviewers prefer.

Trade-off

The flag also bypasses the framework's other default protections (DaemonSet, StaticPod, LocalStorage, FailedBarePods) for failed pods when set. The argument: failed pods are by definition not running, so DaemonSet/StaticPod semantics don't apply the same way, and the operator has explicitly opted in. If reviewers prefer to keep those protections enforced even when this flag is set, the implementation can be tightened to bypass only the system-critical check.

Test plan

go test ./pkg/framework/plugins/removefailedpods/... -v — all existing 25 cases pass plus 2 new cases:

  • system-critical priority failed pod is protected by default (asserts default behaviour is unchanged)
  • includingSystemCriticalPods=true, system-critical priority failed pod is evicted (asserts the opt-in works)

go vet ./pkg/framework/plugins/removefailedpods/... clean. go build ./... clean.

Checklist

  • Code Readability: Is the code easy to understand, well-structured, and consistent with project conventions?
  • Naming Conventions: Are variable, function, and structs descriptive and consistent?
  • Code Duplication: Is there any repeated code that should be refactored?
  • Function/Method Size: Are functions/methods short and focused on a single task?
  • Comments & Documentation: Are comments clear, useful, and not excessive?
  • Error Handling: Are errors handled appropriately?
  • Testing: Are there sufficient unit/integration tests?
  • Performance: Are there any obvious performance issues or unnecessary computations?
  • Dependencies: Are new dependencies justified?
  • Logging & Monitoring: Is logging used appropriately?
  • Backward Compatibility: Does this change break any existing functionality or APIs?
  • Resource Management: Are resources managed and released properly?
  • PR Description: Is the PR description clear, providing enough context?
  • Documentation & Changelog: Are README and docs updated if necessary? (No CHANGELOG.md in repo; docs unchanged.)

@k8s-ci-robot k8s-ci-robot requested review from damemi and googs1025 May 2, 2026 06:25
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @jawwad-ali!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @jawwad-ali. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 2, 2026
Adds IncludingSystemCriticalPods to RemoveFailedPodsArgs. When the
operator explicitly sets the flag, the RemoveFailedPods plugin
bypasses the framework default-evictor filter at candidate-listing
time, allowing failed pods that carry system-critical priority to be
considered for eviction. The PodFailed phase check, validateCanEvict
(Reasons/ExitCodes/MinPodLifetimeSeconds/ExcludeOwnerKinds), and
PreEvictionFilter still gate eviction.

Default behaviour is unchanged: living and failed system-critical
pods stay protected unless the operator opts in.
@jawwad-ali jawwad-ali force-pushed the fix/1775-remove-failed-system-critical-pods branch from d78b789 to f7f585c Compare May 2, 2026 06:57
@googs1025
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 2, 2026
@ingvagabund ingvagabund self-assigned this May 4, 2026
Comment on lines 25 to 36
type RemoveFailedPodsArgs struct {
metav1.TypeMeta `json:",inline"`

Namespaces *api.Namespaces `json:"namespaces,omitempty"`
LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
ExcludeOwnerKinds []string `json:"excludeOwnerKinds,omitempty"`
MinPodLifetimeSeconds *uint `json:"minPodLifetimeSeconds,omitempty"`
Reasons []string `json:"reasons,omitempty"`
ExitCodes []int32 `json:"exitCodes,omitempty"`
IncludingInitContainers bool `json:"includingInitContainers,omitempty"`
Namespaces *api.Namespaces `json:"namespaces,omitempty"`
LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
ExcludeOwnerKinds []string `json:"excludeOwnerKinds,omitempty"`
MinPodLifetimeSeconds *uint `json:"minPodLifetimeSeconds,omitempty"`
Reasons []string `json:"reasons,omitempty"`
ExitCodes []int32 `json:"exitCodes,omitempty"`
IncludingInitContainers bool `json:"includingInitContainers,omitempty"`
IncludingSystemCriticalPods bool `json:"includingSystemCriticalPods,omitempty"`
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small consistency thing: we already have EvictSystemCriticalPods on DefaultEvictor, and the docs/issues all use that wording. Could we name this one along the same lines — e.g.
EvictFailedSystemCriticalPods — so users don't have to keep two mental models for what's essentially the same concept at different scopes?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f05b631 — renamed to EvictFailedSystemCriticalPods (and evictFailedSystemCriticalPods JSON tag) so it parallels EvictSystemCriticalPods on DefaultEvictor.

@@ -25,11 +25,12 @@ import (
type RemoveFailedPodsArgs struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you double-checked that the new field is wired through any versioned API types as well? If RemoveFailedPodsArgs has a counterpart under pkg/api/v1alphaX/, the YAML decode path will silently drop includingSystemCriticalPods and the flag will look like it does nothing in real configs even though tests pass. Probably fine, but worth a grep -r RemoveFailedPodsArgs pkg/api/ to confirm.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked — grep -rn RemoveFailedPodsArgs pkg/api/ returns nothing. Descheduler doesn't use the apimachinery per-plugin versioned-types pattern; plugin args are registered via pluginregistry.Register and decoded through pluginArgConversionScheme with the internal RemoveFailedPodsArgs as the target type, so the new field flows through the YAML decode path without a v1alpha2 shadow. The existing zz_generated.deepcopy.go uses *out = *in which already covers the new bool.

@googs1025
Copy link
Copy Markdown
Member

googs1025 commented May 5, 2026

Reading the code I think this flag as written is named IncludingSystemCriticalPods but actually replaces the whole
DefaultEvictor.Filter (DaemonSet/StaticPod/LocalStorage all fall away too) — but more importantly, even if we tighten the implementation, putting this knob at the plugin level creates a precedence problem: a plugin-level flag would be able to override the framework-level EvictSystemCriticalPods decision, which inverts the usual "framework is the ceiling, plugins can only be more restrictive" convention.
If we ever later add a framework-level EvictFailedSystemCriticalPods to address the same need at the right layer, the two flags would conflict in confusing ways (plugin says yes, framework says no — who wins?).

Address review feedback on kubernetes-sigs#1865:

- Rename IncludingSystemCriticalPods to EvictFailedSystemCriticalPods
  to align with the existing EvictSystemCriticalPods wording on
  DefaultEvictor and avoid two mental models for the same concept at
  different scopes.

- Tighten the filter override so only pods that are both PodFailed
  and system-critical bypass the framework default-evictor filter.
  All other DefaultEvictor protections (DaemonSet, StaticPod,
  LocalStorage, FailedBarePods) continue to apply for failed pods,
  closing the over-broad bypass that was previously documented as a
  trade-off in the PR body.
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ingvagabund. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jawwad-ali
Copy link
Copy Markdown
Author

Thanks for the careful read @googs1025 — both points landed.

Over-broad bypass: fixed in f05b631. The filter override is now scoped to pods that are both PodFailed and utils.IsCriticalPriorityPod; everything else (DaemonSet, StaticPod, LocalStorage, FailedBarePods) still goes through handle.Evictor().Filter unchanged. The PR body's earlier "this also bypasses other protections" trade-off no longer applies.

Precedence inversion: this one is worth a real decision before I push more code, so I want to ask rather than guess.

The remaining concern as I read it: even with the implementation tightened, a plugin-level EvictFailedSystemCriticalPods knob lets a plugin override a framework-level EvictSystemCriticalPods=false decision, which inverts the usual "framework is the ceiling, plugins can only narrow" convention. If you later add a framework-level EvictFailedSystemCriticalPods to DefaultEvictorArgs, the two flags collide.

Two paths from here, happy to take whichever you prefer:

  1. Keep this PR plugin-scoped, document the precedence interaction (plugin opt-in implies overriding the framework's SystemCriticalPods protection for failed pods only), and treat a future framework-level toggle as an additive change rather than a conflict (when present, the framework-level flag is the source of truth and the plugin-level one becomes a no-op or is deprecated).

  2. Move the toggle to the framework — close this PR (or rebuild it), add EvictFailedSystemCriticalPods to DefaultEvictorArgs, and have applySystemCriticalPodsProtection skip its check when pod.Status.Phase == v1.PodFailed && args.EvictFailedSystemCriticalPods. RemoveFailedPods then needs no plugin-level flag at all; users opt in once at the evictor level. This is the architecturally cleaner direction and aligns with the existing EvictSystemCriticalPods knob's location.

I'm leaning toward (2) because it sidesteps the precedence-inversion question entirely and the call-site reads more naturally (evictFailedSystemCriticalPods lives next to evictSystemCriticalPods in the same args block). Path (1) keeps this PR small but inherits the issue you raised.

Which way do you want me to take it? Happy to redo this PR end-to-end at the framework level if (2) is the answer.

@ingvagabund
Copy link
Copy Markdown
Contributor

Dropped comment in #1775 (comment)

@jawwad-ali
Copy link
Copy Markdown
Author

Thanks @ingvagabund — that profile-scoped workaround is a great pointer; I hadn't fully appreciated that disabling SystemCriticalPods in a dedicated RemoveFailedPods-only profile sidesteps the issue today without any code change.

Your last line is the part I want to make sure I'm reading right:

Every pod identified by RemoveFailedPods plugin as evictable is a failed pod by the definition. So the SystemCriticalPods gating can be lifted.

Taken at face value, that argument applies just as cleanly to the plugin's own filter as it does to a user-supplied profile: if every candidate the plugin will ever evict is, by definition, in PodFailed, the framework-level SystemCriticalPods gate is logically redundant for this plugin. Which opens a third option I hadn't considered:

Option C — unconditional lift, no flag. Drop EvictFailedSystemCriticalPods from RemoveFailedPodsArgs entirely; in New(), the filter for RemoveFailedPods always bypasses the SystemCriticalPods check (only — DaemonSet, StaticPod, LocalStorage, FailedBarePods stay enforced). Smallest possible diff, no precedence-inversion question, and the behaviour matches your "by definition" argument.

That sidesteps both your point and @googs1025's earlier precedence-inversion concern, at the cost of a silent behaviour change for any operator currently relying on SystemCriticalPods to also block failed system-critical pods from RemoveFailedPods (release-note worthy, but I'd argue arguably a bug fix rather than a breaking change, given the plugin's contract).

So three paths I can see, want me to take whichever you and @googs1025 converge on:

  1. Close this PR, document the dedicated-profile workaround in the issue / README, leave plugin code untouched.
  2. Option C above — unconditional lift, no flag. Smallest code, behaviour change called out in release notes.
  3. Keep this PR's shape (opt-in EvictFailedSystemCriticalPods flag, tightened to system-critical-only as in f05b631a) — preserves backwards compatibility but accepts the precedence-inversion @googs1025 flagged.

I lean toward (2) given your "by definition" framing — it's the cleanest read of the plugin's contract — but happy to go (1) or (3) if either of you sees a reason I'm missing.

@ingvagabund
Copy link
Copy Markdown
Contributor

The current policy API is designed to allow customization of what is considered a failed pod and whether even failed critical pods should be evicted or not. It's perfectly fine to keep the defaults and not evicting failed critical pods as in some cases a different eviction mechanism might be needed to deployed due to different e.g. lifecycle requirements. E.g. some application installation may have different company policies for evicting critical pods.

@ingvagabund
Copy link
Copy Markdown
Contributor

Maybe I misunderstood the description of the issue in #1775. Yet, the goal is to allow eviction of failed critical pods by RemoveFailedPods plugin. Which is already possible by configuring two profiles each with different DefaultEvictor arguments.

@jawwad-ali
Copy link
Copy Markdown
Author

Thanks for the clarification @ingvagabund — that's a useful correction. I over-read your earlier "the SystemCriticalPods gating can be lifted" as a code-change recommendation; in context (and reading your two follow-up messages together) it's clearly justification for why the dedicated-profile workaround is safe, not a steer to change the plugin's filter chain.

Reflecting your position back to make sure I have it right:

  • The current policy API is intentional — defaulting to protect failed critical pods is the right default because operators have varying compliance/lifecycle policies.
  • The use-case in Failed Pods with system critical priority are note removed by RemoveFailedPods plugin #1775 is already addressable today by configuring a RemoveFailedPods-only profile with SystemCriticalPods in PodProtections.DefaultDisabled.
  • The plugin's filter chain therefore doesn't need a per-plugin escape valve.

If that read is correct, the right outcome is for me to close this PR and (optionally) leave a short comment on #1775 pointing the reporter at your YAML example so the issue can also be closed. Happy to do both.

Want me to wait for @googs1025 to weigh in before closing, or should I close it now? Either is fine — just want to follow whichever cadence the team prefers.

@ingvagabund
Copy link
Copy Markdown
Contributor

ingvagabund commented May 6, 2026

As long as you can validate the suggested existing solution works there's nothing else to do. Maybe adding an example for the plugin under https://github.com/kubernetes-sigs/descheduler#removefailedpods? At this point it is up to @alex-berger to tell us whether the suggested solution works.

@alex-berger
Copy link
Copy Markdown

The suggested solution looks good to me and once it is released I, I can test it on our clusters.

@ingvagabund
Copy link
Copy Markdown
Contributor

The currently suggested solution is already part of the releases: #1775 (comment). Can you please take a look and see if it works for your use case?

@jawwad-ali
Copy link
Copy Markdown
Author

Got it @ingvagabund — clear direction. I'll leave this PR open so it stays the natural place to revisit if @alex-berger hits anything unexpected with the workaround, and close it once he confirms it works on his clusters (also happy to close earlier if you'd rather not have it sit).

On the README example — happy to send a small, separate docs-only PR adding the dedicated-RemoveFailedPods-profile YAML you posted to the RemoveFailedPods section of the README, so future operators hitting #1775's symptom find the pattern documented next to the plugin itself. Want me to:

  1. Open that docs PR now (so it can be reviewed in parallel with @alex-berger's validation), or
  2. Wait until he confirms the workaround works, then open the docs PR?

Either is fine — flag whichever fits your review cadence.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@jawwad-ali: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-descheduler-test-e2e-k8s-master-1-36 f05b631 link true /test pull-descheduler-test-e2e-k8s-master-1-36

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed Pods with system critical priority are note removed by RemoveFailedPods plugin

5 participants