Skip to content

chore: fix two bugs associated with istio ingress annotation processing#6439

Open
mwhittington21 wants to merge 1 commit into
kubernetes-sigs:masterfrom
mwhittington21:mwhittington/fix-istio-annotation-prefix-and-crash
Open

chore: fix two bugs associated with istio ingress annotation processing#6439
mwhittington21 wants to merge 1 commit into
kubernetes-sigs:masterfrom
mwhittington21:mwhittington/fix-istio-annotation-prefix-and-crash

Conversation

@mwhittington21
Copy link
Copy Markdown

What does it do ?

Discovered during development of #6438 which is unlikely to see a merge as-is, so I have separated the fixes into this PR to avoid issues with v0.22.0 when it launches.

Fix 1: fix(istio): resolve annotation prefix not applied to gateway/ingress annotation keys

IstioGatewayIngressSource and K8sGatewaySource were package-level vars evaluated at import time, before --annotation-prefix is applied. This caused Istio gateways using the /ingress or /gateway annotations to silently produce no endpoints, resulting in record deletion under sync policy.

Fix 2: fix(istio): skip gateways with bad annotations instead of crashing the controller

A nonexistent gateway or ingress reference in an annotation caused a fatal error that killed the pod. Now logs a warning and continues processing remaining resources.

Bug information

Bug 1

Supply --annotation-prefix=external-dns.alpha.kubernetes.io/ argument to pod. Start up pod in a cluster where external-dns.alpha.kubernetes.io/ingress type annotations are in use. Observe the following immediate deletions of resources using this pathway:

time="2026-05-14T13:07:59Z" level=info msg="Created Kubernetes client https://10.35.128.1:443"
time="2026-05-14T13:08:00Z" level=info msg="Created GatewayAPI client https://10.35.128.1:443"
time="2026-05-14T13:08:00Z" level=info msg="Records cache provider: refreshing records list cache"
time="2026-05-14T13:08:00Z" level=info msg="Desired change: DELETE argocd.<redacted> A" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:08:00Z" level=info msg="Desired change: DELETE ckkj-argocd-proxy.<redacted> A" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:08:00Z" level=info msg="Desired change: DELETE cname-argocd.<redacted> TXT" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:08:00Z" level=info msg="Desired change: DELETE cname-ckkj-argocd-proxy.<redacted> TXT" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
...etc

After fix, these are recreated (or never deleted):

time="2026-05-14T13:20:24Z" level=info msg="Created Kubernetes client https://10.35.128.1:443"
time="2026-05-14T13:20:25Z" level=info msg="Created GatewayAPI client https://10.35.128.1:443"
time="2026-05-14T13:20:25Z" level=info msg="Records cache provider: refreshing records list cache"
time="2026-05-14T13:20:25Z" level=info msg="Desired change: CREATE argocd.<redacted> A" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:20:25Z" level=info msg="Desired change: CREATE ckkj-argocd-proxy.<redacted> A" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:20:25Z" level=info msg="Desired change: CREATE cname-argocd.<redacted> TXT" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
time="2026-05-14T13:20:25Z" level=info msg="Desired change: CREATE cname-ckkj-argocd-proxy.<redacted> TXT" profile=default zoneID=/hostedzone/<redacted> zoneName=<redacted>
...etc

Bug 2

Before fix:

time="2026-05-14T13:07:09Z" level=info msg="GitCommitShort=unknown, GoVersion=go1.26.1, Platform=linux/amd64, UserAgent=ExternalDNS/v0.21.0-dirty"
time="2026-05-14T13:07:09Z" level=info msg="Created Kubernetes client https://10.35.128.1:443"
time="2026-05-14T13:07:09Z" level=info msg="Created GatewayAPI client https://10.35.128.1:443"
time="2026-05-14T13:07:09Z" level=info msg="Records cache provider: refreshing records list cache"
time="2026-05-14T13:07:10Z" level=error msg="ingress.networking.k8s.io \"default\" not found"
time="2026-05-14T13:07:10Z" level=fatal msg="Failed to do run once: ingress.networking.k8s.io \"default\" not found"
<pod crashes here>

After fix:

time="2026-05-14T13:20:36Z" level=error msg="ingress.networking.k8s.io \"default\" not found"
time="2026-05-14T13:20:36Z" level=warning msg="Could not generate endpoints for VirtualService 'mwhittington-test/mwhittington-test-vsvc': ingress.networking.k8s.io \"default\" not found"

Pod does not crash.

Motivation

These bugs stop us using external-dns for the purposes we need it for internally and would do so in quite a destructive way by removing records that are receiving live traffic. The crash is less severe, but provides an avenue for a denial of service attack in multi-tenant clusters.

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

Fix 1: fix(istio): resolve annotation prefix not applied to gateway/ingress annotation keys

IstioGatewayIngressSource and K8sGatewaySource were package-level vars evaluated at import time, before --annotation-prefix is applied. This caused Istio gateways using the /ingress or /gateway annotations to silently produce no endpoints, resulting in record deletion under sync policy.

Fix 2: fix(istio): skip gateways with bad annotations instead of crashing the controller

A nonexistent gateway or ingress reference in an annotation caused a fatal error that killed the pod. Now logs a warning and continues processing remaining resources.
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 14, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mwhittington21. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mloiseleur for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from szuecs May 14, 2026 14:08
@k8s-ci-robot k8s-ci-robot added source size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 14, 2026
@k8s-ci-robot k8s-ci-robot requested a review from vflaux May 14, 2026 14:08
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 14, 2026
@ivankatliarchuk
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2026
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 25864688313

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-0.003%) to 80.595%

Details

  • Coverage decreased (-0.003%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • 33 coverage regressions across 3 files.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

33 previously-covered lines in 3 files lost coverage.

File Lines Losing Coverage Coverage
istio_virtualservice.go 20 87.83%
istio_gateway.go 12 90.36%
openshift_route.go 1 82.93%

Coverage Stats

Coverage Status
Relevant Lines: 21391
Covered Lines: 17240
Line Coverage: 80.59%
Coverage Strength: 1451.97 hits per line

💛 - Coveralls

Copy link
Copy Markdown
Member

@ivankatliarchuk ivankatliarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full call chain is clear. Here's what currently happens:

  targetsFromIngress          → return nil, err
    ↓
  endpointsFromGateway        → return nil, err
    ↓
  Endpoints()                 → return nil, err
    ↓ (increments sourceErrorsTotal metric)
  RunOnce()                   → return err
    ↓ (checks errors.Is(err, provider.SoftError) — this is NOT a SoftError)
  Run()                       → return fmt.Errorf("failed to do run once: %w", err)
    ↓
  execute.go              → log.Fatal(err)   ← pod crashes

The Run() loop has two paths for errors:

  • SoftError → logs at ERROR, increments a counter, continues the loop
  • anything else → breaks the loop → log.Fatal → pod exits

Source errors (from Endpoints()) are plain Go errors - not SoftError - so they always take the fatal path. This confirms exactly the crash described in the PR.

At the moment crash is intentional, with the fix

  • Pod stays alive (no crash)
  • Reconciliation cycle completes for all other resources
  • The skipped gateway/VirtualService produces no desired endpoints
  • ApplyChanges runs and deletes the DNS records for that gateway's hostnames
  • Error is logged at WARN level only
  • No retry signal - external-dns considers the cycle successful

Net effect: trades a pod crash for silent record deletion, which is arguably worse in production.

I leave it for other maintainers to review, think of a decision.

One of the options is to introduce the mechanism that already exists in the codebase: SoftError. The provider package is already imported in source/service.go, and Run() already handles it:

// controller/controller.go Run()
  if errors.Is(err, provider.SoftError) {
      log.Errorf("Failed to do run once: %v (consecutive soft errors: %d)", err, softErrorCount)
      // continues the loop — but crucially, RunOnce() already returned before ApplyChanges
  } else {
      return err // → log.Fatal
  }

When Endpoints() returns a SoftError:

  • RunOnce() returns early, before plan.Calculate() and ApplyChanges() are ever reached
  • Records are preserved
  • Pod stays alive
  • Next cycle retries

Comment thread source/istio_gateway.go
@@ -167,7 +169,8 @@ func (sc *gatewaySource) Endpoints(_ context.Context) ([]*endpoint.Endpoint, err
},
)
if err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reverted.

A template error is a configuration error that will fail for every resource on every reconciliation cycle, producing a warning flood without ever halting or surfacing the root cause. Every other source propagates template errors up.

Comment thread source/istio_gateway.go
// is implemented by an Ingress object instead of a standard LoadBalancer service type.
// This must be a function (not a package-level var) because the annotation prefix can
// be customized at runtime via --annotation-prefix / SetAnnotationPrefix.
func IstioGatewayIngressSource() string { return annotations.Ingress }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not clear why this wrapper is even required, we could just remove the wrapper, use annotations.Ingress directly, to satisfy split brain annotation.

The bug related to that feature #5923

We may going to have similar issues elsewhere.

Comment thread source/istio_gateway.go
gwEndpoints, err := sc.endpointsFromGateway(gwHostnames, gateway)
if err != nil {
return nil, err
log.Warnf("Could not generate endpoints for gateway '%s/%s': %v", gateway.Namespace, gateway.Name, err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need other maintainers view on that one. From high level perspective it make sense, but very much on the edge. As there is is a behaviour change.

The PR's blanket warn+continue on endpointsFromGateway is API change for a specific reason - look at what targetsFromGateway can return errors from:

  endpointsFromGateway
    └─ targetsFromGateway
         ├─ targetsFromIngress        ← per-resource: bad annotation / ingress not found
         └─ EndpointTargetsFromServices ← SYSTEMIC: fails to list services in a namespace

EndpointTargetsFromServices failing means the informer/API is broken — that affects every gateway, not one. Swallowing that with warn+continue would silently skip all gateways each cycle with no escalation.

There's also a double-logging bug: targetsFromIngress already calls log.Error(err) before returning, and then the PR adds log.Warnf(...) at the caller - two log lines at two different levels for the same event.

The error handling belongs inside targetsFromIngress, not at the Endpoints() loop level. When the annotation references an ingress that doesn't exist, that's not an error — it means zero targets for that gateway (the same outcome as a service with no load balancer IPs, which already returns empty with no error).

This is not a call for action, but a probabal change

Both istio_gateway.go and istio_virtualservice.go have identical targetsFromIngress implementations. The fix is the same in both:

  ingress, err := sc.ingressInformer.Lister().Ingresses(namespace).Get(name)
  if err != nil {
      log.Error(err)          // remove this
      return nil, err         // replace with ↓
  }

  ingress, err := sc.ingressInformer.Lister().Ingresses(namespace).Get(name)
  if err != nil {
      log.Warnf("Could not get ingress %s/%s referenced by gateway %s/%s: %v", namespace, name, gateway.Namespace, gateway.Name, err)
      return endpoint.Targets{}, nil
  }

@ivankatliarchuk
Copy link
Copy Markdown
Member

ivankatliarchuk commented May 16, 2026

Maybe worth to split bug fixes, as one seems straightforward.

@mwhittington21
Copy link
Copy Markdown
Author

Thanks for the review, I'll split the bugfixes and address the feedback then request a re-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants