Skip to content

feat(controller): add deletions_skipped_by_policy metric#6386

Open
jumasa wants to merge 1 commit into
kubernetes-sigs:masterfrom
jumasa:feat/policy-suppressed-deletions-metric
Open

feat(controller): add deletions_skipped_by_policy metric#6386
jumasa wants to merge 1 commit into
kubernetes-sigs:masterfrom
jumasa:feat/policy-suppressed-deletions-metric

Conversation

@jumasa
Copy link
Copy Markdown

@jumasa jumasa commented Apr 22, 2026

Upsert-only and create-only act as safety nets but silently swallow deletions and (under create-only) updates: a reconcile that held back changes is indistinguishable from a true no-op today.

  • Add external_dns_controller_deletions_skipped_by_policy{owned}, reset-on-idle, with owned={true,false,unknown}.
  • Partition by ownership before deduplication — RemoveDuplicates keys on (DNS name, type, set identifier) only and would otherwise let a foreign twin evict the owned record from the count.
  • Emit per-record debug logs for skipped deletions and (under create-only) skipped updates; skipped updates have no metric by design.
  • Distinguish a true no-op log from one where policy held changes back, so drift under create-only (no metric fires there when only updates are suppressed) is not silently masked.

What does it do ?

Adds observability for DNS changes that --policy=upsert-only / --policy=create-only hold back:

  • New gauge external_dns_controller_deletions_skipped_by_policy{owned}, reset-on-idle, with owned ∈ {true, false, unknown}.
  • Structured debug logs for skipped deletions and (under create-only) skipped updates, sharing the fields record, type, targets, owned, policy.
  • The reconcile-end log now distinguishes a true no-op from a reconcile where policy suppressed changes: No DNS changes applied; N deletion[s] held back by policy replaces All records are already up to date in that case.
  • Docs: PromQL for a reset-on-idle gauge, interplay with no_op_runs_total, debug-log volume on large clusters, and an rfc2136 AXFR-disabled caveat.

Motivation

Operators running either policy today have no observable signal when it suppresses a deletion or update: reconcile logs look healthy, no_op_runs_total ticks, and the drift (empty source, broken AXFR, misconfigured CRD) only surfaces once the
stale record causes trouble. This PR exposes that signal so it can be dashboarded and alerted on.

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

Upsert-only and create-only act as safety nets but silently swallow
deletions and (under create-only) updates: a reconcile that held back
changes is indistinguishable from a true no-op today.

* Add `external_dns_controller_deletions_skipped_by_policy{owned}`,
  reset-on-idle, with owned={true,false,unknown}.
* Partition by ownership before deduplication — RemoveDuplicates keys
  on (DNS name, type, set identifier) only and would otherwise let a
  foreign twin evict the owned record from the count.
* Emit per-record debug logs for skipped deletions and (under
  create-only) skipped updates; skipped updates have no metric by
  design.
* Distinguish a true no-op log from one where policy held changes
  back, so drift under create-only (no metric fires there when only
  updates are suppressed) is not silently masked.
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 22, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: jumasa / name: Juan Manuel Sa (23b03c6)

@k8s-ci-robot k8s-ci-robot added the controller Issues or PRs related to the controller label Apr 22, 2026
@k8s-ci-robot k8s-ci-robot requested a review from szuecs April 22, 2026 16:34
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mloiseleur for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @jumasa!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added docs internal Issues or PRs related to internal code needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. plan Issues or PRs related to external-dns plan labels Apr 22, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @jumasa. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 22, 2026
@ivankatliarchuk
Copy link
Copy Markdown
Member

This pr is AI slop-heavy and doing what is already supported, most likely should not be part of the codebase. The PR doesn't acknowledge verified_records exists or explain why a derived PromQL expression is insufficient, aka there is a gap in justification.

The existing metrics already provide a way to detect the same drift:

  • verified_records{record_type} counts records that exist in both source and registry
  • registry_endpoints_total (or registry_records{record_type}) counts everything in DNS under this instance

Aka

  • The registry - verified delta detects the same condition without any code changes
  • It would also catch drift from other causes (orphaned records, broken AXFR) — arguably more general
  • The PR's own docs mention AXFR-disabled as a case where the new metric fires zero despite active drift, but registry verified would catch that correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. controller Issues or PRs related to the controller docs internal Issues or PRs related to internal code needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. plan Issues or PRs related to external-dns plan size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants