fix(removefailedpods): allow opt-in eviction of failed system-critical pods#1865
Conversation
|
Welcome @jawwad-ali! |
|
Hi @jawwad-ali. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Adds IncludingSystemCriticalPods to RemoveFailedPodsArgs. When the operator explicitly sets the flag, the RemoveFailedPods plugin bypasses the framework default-evictor filter at candidate-listing time, allowing failed pods that carry system-critical priority to be considered for eviction. The PodFailed phase check, validateCanEvict (Reasons/ExitCodes/MinPodLifetimeSeconds/ExcludeOwnerKinds), and PreEvictionFilter still gate eviction. Default behaviour is unchanged: living and failed system-critical pods stay protected unless the operator opts in.
d78b789 to
f7f585c
Compare
|
/ok-to-test |
| type RemoveFailedPodsArgs struct { | ||
| metav1.TypeMeta `json:",inline"` | ||
|
|
||
| Namespaces *api.Namespaces `json:"namespaces,omitempty"` | ||
| LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"` | ||
| ExcludeOwnerKinds []string `json:"excludeOwnerKinds,omitempty"` | ||
| MinPodLifetimeSeconds *uint `json:"minPodLifetimeSeconds,omitempty"` | ||
| Reasons []string `json:"reasons,omitempty"` | ||
| ExitCodes []int32 `json:"exitCodes,omitempty"` | ||
| IncludingInitContainers bool `json:"includingInitContainers,omitempty"` | ||
| Namespaces *api.Namespaces `json:"namespaces,omitempty"` | ||
| LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"` | ||
| ExcludeOwnerKinds []string `json:"excludeOwnerKinds,omitempty"` | ||
| MinPodLifetimeSeconds *uint `json:"minPodLifetimeSeconds,omitempty"` | ||
| Reasons []string `json:"reasons,omitempty"` | ||
| ExitCodes []int32 `json:"exitCodes,omitempty"` | ||
| IncludingInitContainers bool `json:"includingInitContainers,omitempty"` | ||
| IncludingSystemCriticalPods bool `json:"includingSystemCriticalPods,omitempty"` | ||
| } |
There was a problem hiding this comment.
Small consistency thing: we already have EvictSystemCriticalPods on DefaultEvictor, and the docs/issues all use that wording. Could we name this one along the same lines — e.g.
EvictFailedSystemCriticalPods — so users don't have to keep two mental models for what's essentially the same concept at different scopes?
There was a problem hiding this comment.
Done in f05b631 — renamed to EvictFailedSystemCriticalPods (and evictFailedSystemCriticalPods JSON tag) so it parallels EvictSystemCriticalPods on DefaultEvictor.
| @@ -25,11 +25,12 @@ import ( | |||
| type RemoveFailedPodsArgs struct { | |||
There was a problem hiding this comment.
Have you double-checked that the new field is wired through any versioned API types as well? If RemoveFailedPodsArgs has a counterpart under pkg/api/v1alphaX/, the YAML decode path will silently drop includingSystemCriticalPods and the flag will look like it does nothing in real configs even though tests pass. Probably fine, but worth a grep -r RemoveFailedPodsArgs pkg/api/ to confirm.
There was a problem hiding this comment.
Checked — grep -rn RemoveFailedPodsArgs pkg/api/ returns nothing. Descheduler doesn't use the apimachinery per-plugin versioned-types pattern; plugin args are registered via pluginregistry.Register and decoded through pluginArgConversionScheme with the internal RemoveFailedPodsArgs as the target type, so the new field flows through the YAML decode path without a v1alpha2 shadow. The existing zz_generated.deepcopy.go uses *out = *in which already covers the new bool.
|
Reading the code I think this flag as written is named IncludingSystemCriticalPods but actually replaces the whole |
Address review feedback on kubernetes-sigs#1865: - Rename IncludingSystemCriticalPods to EvictFailedSystemCriticalPods to align with the existing EvictSystemCriticalPods wording on DefaultEvictor and avoid two mental models for the same concept at different scopes. - Tighten the filter override so only pods that are both PodFailed and system-critical bypass the framework default-evictor filter. All other DefaultEvictor protections (DaemonSet, StaticPod, LocalStorage, FailedBarePods) continue to apply for failed pods, closing the over-broad bypass that was previously documented as a trade-off in the PR body.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Thanks for the careful read @googs1025 — both points landed. Over-broad bypass: fixed in f05b631. The filter override is now scoped to pods that are both Precedence inversion: this one is worth a real decision before I push more code, so I want to ask rather than guess. The remaining concern as I read it: even with the implementation tightened, a plugin-level Two paths from here, happy to take whichever you prefer:
I'm leaning toward (2) because it sidesteps the precedence-inversion question entirely and the call-site reads more naturally ( Which way do you want me to take it? Happy to redo this PR end-to-end at the framework level if (2) is the answer. |
|
Dropped comment in #1775 (comment) |
|
Thanks @ingvagabund — that profile-scoped workaround is a great pointer; I hadn't fully appreciated that disabling Your last line is the part I want to make sure I'm reading right:
Taken at face value, that argument applies just as cleanly to the plugin's own filter as it does to a user-supplied profile: if every candidate the plugin will ever evict is, by definition, in Option C — unconditional lift, no flag. Drop That sidesteps both your point and @googs1025's earlier precedence-inversion concern, at the cost of a silent behaviour change for any operator currently relying on So three paths I can see, want me to take whichever you and @googs1025 converge on:
I lean toward (2) given your "by definition" framing — it's the cleanest read of the plugin's contract — but happy to go (1) or (3) if either of you sees a reason I'm missing. |
|
The current policy API is designed to allow customization of what is considered a failed pod and whether even failed critical pods should be evicted or not. It's perfectly fine to keep the defaults and not evicting failed critical pods as in some cases a different eviction mechanism might be needed to deployed due to different e.g. lifecycle requirements. E.g. some application installation may have different company policies for evicting critical pods. |
|
Maybe I misunderstood the description of the issue in #1775. Yet, the goal is to allow eviction of failed critical pods by RemoveFailedPods plugin. Which is already possible by configuring two profiles each with different DefaultEvictor arguments. |
|
Thanks for the clarification @ingvagabund — that's a useful correction. I over-read your earlier "the SystemCriticalPods gating can be lifted" as a code-change recommendation; in context (and reading your two follow-up messages together) it's clearly justification for why the dedicated-profile workaround is safe, not a steer to change the plugin's filter chain. Reflecting your position back to make sure I have it right:
If that read is correct, the right outcome is for me to close this PR and (optionally) leave a short comment on #1775 pointing the reporter at your YAML example so the issue can also be closed. Happy to do both. Want me to wait for @googs1025 to weigh in before closing, or should I close it now? Either is fine — just want to follow whichever cadence the team prefers. |
|
As long as you can validate the suggested existing solution works there's nothing else to do. Maybe adding an example for the plugin under https://github.com/kubernetes-sigs/descheduler#removefailedpods? At this point it is up to @alex-berger to tell us whether the suggested solution works. |
|
The suggested solution looks good to me and once it is released I, I can test it on our clusters. |
|
The currently suggested solution is already part of the releases: #1775 (comment). Can you please take a look and see if it works for your use case? |
|
Got it @ingvagabund — clear direction. I'll leave this PR open so it stays the natural place to revisit if @alex-berger hits anything unexpected with the workaround, and close it once he confirms it works on his clusters (also happy to close earlier if you'd rather not have it sit). On the README example — happy to send a small, separate docs-only PR adding the dedicated-
Either is fine — flag whichever fits your review cadence. |
|
@jawwad-ali: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
Fixes #1775.
The
RemoveFailedPodsplugin chainshandle.Evictor().Filterinto its candidate-listing filter, which protects all system-critical-priority pods regardless ofpod.Status.Phase. As reported in #1775, this prevents cleanup ofFailed-phase system-critical pods (e.g. a stuckcert-managerpod). DisablingSystemCriticalPodsprotection at theDefaultEvictorlevel is not an acceptable workaround because it would also expose living system-critical pods to eviction.This PR adds an opt-in
IncludingSystemCriticalPods boolfield toRemoveFailedPodsArgs. When the operator explicitly sets it, the plugin replaceshandle.Evictor().Filterwith an unconditional permit at candidate-listing time. The eviction is still gated by:PodFailedphase checkvalidateCanEvict(Reasons,ExitCodes,MinPodLifetimeSeconds,ExcludeOwnerKinds, namespaces, labels)handle.Evictor().PreEvictionFilterDefault behaviour is unchanged — both living and failed system-critical pods stay protected unless the operator opts in.
This addresses @googs1025's "by design" comment on the issue: the framework-level protection stays the framework default; the new flag is a per-plugin, per-operator escape valve scoped specifically to the cleanup-stuck-failed-pods use case, without the broader blast radius of disabling
SystemCriticalPodsinDefaultEvictor.Posted the design proposal on the issue thread first (#1775 (comment)) — happy to make the toggle finer-grained (e.g. a per-protection struct) if reviewers prefer.
Trade-off
The flag also bypasses the framework's other default protections (DaemonSet, StaticPod, LocalStorage, FailedBarePods) for failed pods when set. The argument: failed pods are by definition not running, so DaemonSet/StaticPod semantics don't apply the same way, and the operator has explicitly opted in. If reviewers prefer to keep those protections enforced even when this flag is set, the implementation can be tightened to bypass only the system-critical check.
Test plan
go test ./pkg/framework/plugins/removefailedpods/... -v— all existing 25 cases pass plus 2 new cases:system-critical priority failed pod is protected by default(asserts default behaviour is unchanged)includingSystemCriticalPods=true, system-critical priority failed pod is evicted(asserts the opt-in works)go vet ./pkg/framework/plugins/removefailedpods/...clean.go build ./...clean.Checklist