chore: fix two bugs associated with istio ingress annotation processing by mwhittington21 · Pull Request #6439 · kubernetes-sigs/external-dns

mwhittington21 · 2026-05-14T14:07:44Z

What does it do ?

Discovered during development of #6438 which is unlikely to see a merge as-is, so I have separated the fixes into this PR to avoid issues with v0.22.0 when it launches.

Fix 1: fix(istio): resolve annotation prefix not applied to gateway/ingress annotation keys

IstioGatewayIngressSource and K8sGatewaySource were package-level vars evaluated at import time, before --annotation-prefix is applied. This caused Istio gateways using the /ingress or /gateway annotations to silently produce no endpoints, resulting in record deletion under sync policy.

Fix 2: fix(istio): skip gateways with bad annotations instead of crashing the controller

A nonexistent gateway or ingress reference in an annotation caused a fatal error that killed the pod. Now logs a warning and continues processing remaining resources.

Bug information

Bug 1

Supply --annotation-prefix=external-dns.alpha.kubernetes.io/ argument to pod. Start up pod in a cluster where external-dns.alpha.kubernetes.io/ingress type annotations are in use. Observe the following immediate deletions of resources using this pathway:

time="2026-05-14T13:07:59Z" level=info msg="Created Kubernetes client https://10.35.128.1:443"
time="2026-05-14T13:08:00Z" level=info msg="Created GatewayAPI client https://10.35.128.1:443"
time="2026-05-14T13:08:00Z" level=info msg="Records cache provider: refreshing records list cache"
time="2026-05-14T13:08:00Z" level=info msg="Desired change: DELETE argocd.<redacted> A" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:08:00Z" level=info msg="Desired change: DELETE ckkj-argocd-proxy.<redacted> A" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:08:00Z" level=info msg="Desired change: DELETE cname-argocd.<redacted> TXT" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:08:00Z" level=info msg="Desired change: DELETE cname-ckkj-argocd-proxy.<redacted> TXT" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
...etc

After fix, these are recreated (or never deleted):

time="2026-05-14T13:20:24Z" level=info msg="Created Kubernetes client https://10.35.128.1:443"
time="2026-05-14T13:20:25Z" level=info msg="Created GatewayAPI client https://10.35.128.1:443"
time="2026-05-14T13:20:25Z" level=info msg="Records cache provider: refreshing records list cache"
time="2026-05-14T13:20:25Z" level=info msg="Desired change: CREATE argocd.<redacted> A" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:20:25Z" level=info msg="Desired change: CREATE ckkj-argocd-proxy.<redacted> A" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:20:25Z" level=info msg="Desired change: CREATE cname-argocd.<redacted> TXT" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:20:25Z" level=info msg="Desired change: CREATE cname-ckkj-argocd-proxy.<redacted> TXT" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
...etc

Bug 2

Before fix:

time="2026-05-14T13:07:09Z" level=info msg="GitCommitShort=unknown, GoVersion=go1.26.1, Platform=linux/amd64, UserAgent=ExternalDNS/v0.21.0-dirty"
time="2026-05-14T13:07:09Z" level=info msg="Created Kubernetes client https://10.35.128.1:443"
time="2026-05-14T13:07:09Z" level=info msg="Created GatewayAPI client https://10.35.128.1:443"
time="2026-05-14T13:07:09Z" level=info msg="Records cache provider: refreshing records list cache"
time="2026-05-14T13:07:10Z" level=error msg="ingress.networking.k8s.io \"default\" not found"
time="2026-05-14T13:07:10Z" level=fatal msg="Failed to do run once: ingress.networking.k8s.io \"default\" not found"
<pod crashes here>

After fix:

time="2026-05-14T13:20:36Z" level=error msg="ingress.networking.k8s.io \"default\" not found"
time="2026-05-14T13:20:36Z" level=warning msg="Could not generate endpoints for VirtualService 'mwhittington-test/mwhittington-test-vsvc': ingress.networking.k8s.io \"default\" not found"

Pod does not crash.

Motivation

These bugs stop us using external-dns for the purposes we need it for internally and would do so in quite a destructive way by removing records that are receiving live traffic. The crash is less severe, but provides an avenue for a denial of service attack in multi-tenant clusters.

More

Yes, this PR title follows Conventional Commits
Yes, I added unit tests
Yes, I updated end user documentation accordingly

Fix 1: fix(istio): resolve annotation prefix not applied to gateway/ingress annotation keys IstioGatewayIngressSource and K8sGatewaySource were package-level vars evaluated at import time, before --annotation-prefix is applied. This caused Istio gateways using the /ingress or /gateway annotations to silently produce no endpoints, resulting in record deletion under sync policy. Fix 2: fix(istio): skip gateways with bad annotations instead of crashing the controller A nonexistent gateway or ingress reference in an annotation caused a fatal error that killed the pod. Now logs a warning and continues processing remaining resources.

k8s-ci-robot · 2026-05-14T14:07:55Z

Hi @mwhittington21. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-05-14T14:08:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mloiseleur for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ivankatliarchuk · 2026-05-15T09:17:07Z

/ok-to-test

coveralls · 2026-05-15T09:24:28Z

Coverage Report for CI Build 25864688313

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-0.003%) to 80.595%

Details

Coverage decreased (-0.003%) from the base build.
Patch coverage: No coverable lines changed in this PR.
33 coverage regressions across 3 files.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

33 previously-covered lines in 3 files lost coverage.

File	Lines Losing Coverage	Coverage
istio_virtualservice.go	20	87.83%
istio_gateway.go	12	90.36%
openshift_route.go	1	82.93%

Coverage Stats


Relevant Lines:	21391
Covered Lines:	17240
Line Coverage:	80.59%
Coverage Strength:	1451.97 hits per line

💛 - Coveralls

ivankatliarchuk

The full call chain is clear. Here's what currently happens:

  targetsFromIngress          → return nil, err
    ↓
  endpointsFromGateway        → return nil, err
    ↓
  Endpoints()                 → return nil, err
    ↓ (increments sourceErrorsTotal metric)
  RunOnce()                   → return err
    ↓ (checks errors.Is(err, provider.SoftError) — this is NOT a SoftError)
  Run()                       → return fmt.Errorf("failed to do run once: %w", err)
    ↓
  execute.go              → log.Fatal(err)   ← pod crashes

The Run() loop has two paths for errors:

SoftError → logs at ERROR, increments a counter, continues the loop
anything else → breaks the loop → log.Fatal → pod exits

Source errors (from Endpoints()) are plain Go errors - not SoftError - so they always take the fatal path. This confirms exactly the crash described in the PR.

At the moment crash is intentional, with the fix

Pod stays alive (no crash)
Reconciliation cycle completes for all other resources
The skipped gateway/VirtualService produces no desired endpoints
ApplyChanges runs and deletes the DNS records for that gateway's hostnames
Error is logged at WARN level only
No retry signal - external-dns considers the cycle successful

Net effect: trades a pod crash for silent record deletion, which is arguably worse in production.

I leave it for other maintainers to review, think of a decision.

One of the options is to introduce the mechanism that already exists in the codebase: SoftError. The provider package is already imported in source/service.go, and Run() already handles it:

// controller/controller.go Run()
  if errors.Is(err, provider.SoftError) {
      log.Errorf("Failed to do run once: %v (consecutive soft errors: %d)", err, softErrorCount)
      // continues the loop — but crucially, RunOnce() already returned before ApplyChanges
  } else {
      return err // → log.Fatal
  }

When Endpoints() returns a SoftError:

RunOnce() returns early, before plan.Calculate() and ApplyChanges() are ever reached
Records are preserved
Pod stays alive
Next cycle retries

ivankatliarchuk · 2026-05-15T09:36:36Z

@@ -167,7 +169,8 @@ func (sc *gatewaySource) Endpoints(_ context.Context) ([]*endpoint.Endpoint, err
 			},
 		)
 		if err != nil {


This should be reverted.

A template error is a configuration error that will fail for every resource on every reconciliation cycle, producing a warning flood without ever halting or surfacing the root cause. Every other source propagates template errors up.

ivankatliarchuk · 2026-05-15T09:38:15Z

+// is implemented by an Ingress object instead of a standard LoadBalancer service type.
+// This must be a function (not a package-level var) because the annotation prefix can
+// be customized at runtime via --annotation-prefix / SetAnnotationPrefix.
+func IstioGatewayIngressSource() string { return annotations.Ingress }


Technically not clear why this wrapper is even required, we could just remove the wrapper, use annotations.Ingress directly, to satisfy split brain annotation.

The bug related to that feature #5923

We may going to have similar issues elsewhere.

ivankatliarchuk · 2026-05-15T09:46:37Z

 		gwEndpoints, err := sc.endpointsFromGateway(gwHostnames, gateway)
 		if err != nil {
-			return nil, err
+			log.Warnf("Could not generate endpoints for gateway '%s/%s': %v", gateway.Namespace, gateway.Name, err)


Need other maintainers view on that one. From high level perspective it make sense, but very much on the edge. As there is is a behaviour change.

The PR's blanket warn+continue on endpointsFromGateway is API change for a specific reason - look at what targetsFromGateway can return errors from:

endpointsFromGateway └─ targetsFromGateway ├─ targetsFromIngress ← per-resource: bad annotation / ingress not found └─ EndpointTargetsFromServices ← SYSTEMIC: fails to list services in a namespace

EndpointTargetsFromServices failing means the informer/API is broken — that affects every gateway, not one. Swallowing that with warn+continue would silently skip all gateways each cycle with no escalation.

There's also a double-logging bug: targetsFromIngress already calls log.Error(err) before returning, and then the PR adds log.Warnf(...) at the caller - two log lines at two different levels for the same event.

The error handling belongs inside targetsFromIngress, not at the Endpoints() loop level. When the annotation references an ingress that doesn't exist, that's not an error — it means zero targets for that gateway (the same outcome as a service with no load balancer IPs, which already returns empty with no error).

This is not a call for action, but a probabal change

Both istio_gateway.go and istio_virtualservice.go have identical targetsFromIngress implementations. The fix is the same in both: ingress, err := sc.ingressInformer.Lister().Ingresses(namespace).Get(name) if err != nil { log.Error(err) // remove this return nil, err // replace with ↓ } ingress, err := sc.ingressInformer.Lister().Ingresses(namespace).Get(name) if err != nil { log.Warnf("Could not get ingress %s/%s referenced by gateway %s/%s: %v", namespace, name, gateway.Namespace, gateway.Name, err) return endpoint.Targets{}, nil }

ivankatliarchuk · 2026-05-16T21:41:41Z

Maybe worth to split bug fixes, as one seems straightforward.

mwhittington21 · 2026-05-18T23:48:59Z

Thanks for the review, I'll split the bugfixes and address the feedback then request a re-review

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 14, 2026

k8s-ci-robot requested a review from szuecs May 14, 2026 14:08

k8s-ci-robot added source size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 14, 2026

k8s-ci-robot requested a review from vflaux May 14, 2026 14:08

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 14, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026

ivankatliarchuk reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: fix two bugs associated with istio ingress annotation processing#6439

chore: fix two bugs associated with istio ingress annotation processing#6439
mwhittington21 wants to merge 1 commit into
kubernetes-sigs:masterfrom
mwhittington21:mwhittington/fix-istio-annotation-prefix-and-crash

mwhittington21 commented May 14, 2026

Uh oh!

k8s-ci-robot commented May 14, 2026

Uh oh!

k8s-ci-robot commented May 14, 2026

Uh oh!

ivankatliarchuk commented May 15, 2026

Uh oh!

coveralls commented May 15, 2026

Uh oh!

ivankatliarchuk left a comment

Uh oh!

ivankatliarchuk May 15, 2026

Uh oh!

ivankatliarchuk May 15, 2026

Uh oh!

ivankatliarchuk May 15, 2026

Uh oh!

ivankatliarchuk commented May 16, 2026 •

edited

Loading

Uh oh!

mwhittington21 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mwhittington21 commented May 14, 2026

What does it do ?

Bug information

Bug 1

Bug 2

Motivation

More

Uh oh!

k8s-ci-robot commented May 14, 2026

Uh oh!

k8s-ci-robot commented May 14, 2026

Uh oh!

ivankatliarchuk commented May 15, 2026

Uh oh!

coveralls commented May 15, 2026

Coverage Report for CI Build 25864688313

Coverage decreased (-0.003%) to 80.595%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

ivankatliarchuk left a comment

Choose a reason for hiding this comment

Uh oh!

ivankatliarchuk May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ivankatliarchuk May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ivankatliarchuk May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ivankatliarchuk commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwhittington21 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ivankatliarchuk commented May 16, 2026 •

edited

Loading