chore: fix two bugs associated with istio ingress annotation processing#6439
Conversation
Fix 1: fix(istio): resolve annotation prefix not applied to gateway/ingress annotation keys IstioGatewayIngressSource and K8sGatewaySource were package-level vars evaluated at import time, before --annotation-prefix is applied. This caused Istio gateways using the /ingress or /gateway annotations to silently produce no endpoints, resulting in record deletion under sync policy. Fix 2: fix(istio): skip gateways with bad annotations instead of crashing the controller A nonexistent gateway or ingress reference in an annotation caused a fatal error that killed the pod. Now logs a warning and continues processing remaining resources.
|
Hi @mwhittington21. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
Coverage Report for CI Build 25864688313Warning Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes. Coverage decreased (-0.003%) to 80.595%Details
Uncovered ChangesNo uncovered changes found. Coverage Regressions33 previously-covered lines in 3 files lost coverage.
Coverage Stats
💛 - Coveralls |
ivankatliarchuk
left a comment
There was a problem hiding this comment.
The full call chain is clear. Here's what currently happens:
targetsFromIngress → return nil, err
↓
endpointsFromGateway → return nil, err
↓
Endpoints() → return nil, err
↓ (increments sourceErrorsTotal metric)
RunOnce() → return err
↓ (checks errors.Is(err, provider.SoftError) — this is NOT a SoftError)
Run() → return fmt.Errorf("failed to do run once: %w", err)
↓
execute.go → log.Fatal(err) ← pod crashes
The Run() loop has two paths for errors:
- SoftError → logs at ERROR, increments a counter, continues the loop
- anything else → breaks the loop → log.Fatal → pod exits
Source errors (from Endpoints()) are plain Go errors - not SoftError - so they always take the fatal path. This confirms exactly the crash described in the PR.
At the moment crash is intentional, with the fix
- Pod stays alive (no crash)
- Reconciliation cycle completes for all other resources
- The skipped gateway/VirtualService produces no desired endpoints
- ApplyChanges runs and deletes the DNS records for that gateway's hostnames
- Error is logged at WARN level only
- No retry signal - external-dns considers the cycle successful
Net effect: trades a pod crash for silent record deletion, which is arguably worse in production.
I leave it for other maintainers to review, think of a decision.
One of the options is to introduce the mechanism that already exists in the codebase: SoftError. The provider package is already imported in source/service.go, and Run() already handles it:
// controller/controller.go Run()
if errors.Is(err, provider.SoftError) {
log.Errorf("Failed to do run once: %v (consecutive soft errors: %d)", err, softErrorCount)
// continues the loop — but crucially, RunOnce() already returned before ApplyChanges
} else {
return err // → log.Fatal
}
When Endpoints() returns a SoftError:
- RunOnce() returns early, before plan.Calculate() and ApplyChanges() are ever reached
- Records are preserved
- Pod stays alive
- Next cycle retries
| @@ -167,7 +169,8 @@ func (sc *gatewaySource) Endpoints(_ context.Context) ([]*endpoint.Endpoint, err | |||
| }, | |||
| ) | |||
| if err != nil { | |||
There was a problem hiding this comment.
This should be reverted.
A template error is a configuration error that will fail for every resource on every reconciliation cycle, producing a warning flood without ever halting or surfacing the root cause. Every other source propagates template errors up.
| // is implemented by an Ingress object instead of a standard LoadBalancer service type. | ||
| // This must be a function (not a package-level var) because the annotation prefix can | ||
| // be customized at runtime via --annotation-prefix / SetAnnotationPrefix. | ||
| func IstioGatewayIngressSource() string { return annotations.Ingress } |
There was a problem hiding this comment.
Technically not clear why this wrapper is even required, we could just remove the wrapper, use annotations.Ingress directly, to satisfy split brain annotation.
The bug related to that feature #5923
We may going to have similar issues elsewhere.
| gwEndpoints, err := sc.endpointsFromGateway(gwHostnames, gateway) | ||
| if err != nil { | ||
| return nil, err | ||
| log.Warnf("Could not generate endpoints for gateway '%s/%s': %v", gateway.Namespace, gateway.Name, err) |
There was a problem hiding this comment.
Need other maintainers view on that one. From high level perspective it make sense, but very much on the edge. As there is is a behaviour change.
The PR's blanket warn+continue on endpointsFromGateway is API change for a specific reason - look at what targetsFromGateway can return errors from:
endpointsFromGateway
└─ targetsFromGateway
├─ targetsFromIngress ← per-resource: bad annotation / ingress not found
└─ EndpointTargetsFromServices ← SYSTEMIC: fails to list services in a namespace
EndpointTargetsFromServices failing means the informer/API is broken — that affects every gateway, not one. Swallowing that with warn+continue would silently skip all gateways each cycle with no escalation.
There's also a double-logging bug: targetsFromIngress already calls log.Error(err) before returning, and then the PR adds log.Warnf(...) at the caller - two log lines at two different levels for the same event.
The error handling belongs inside targetsFromIngress, not at the Endpoints() loop level. When the annotation references an ingress that doesn't exist, that's not an error — it means zero targets for that gateway (the same outcome as a service with no load balancer IPs, which already returns empty with no error).
This is not a call for action, but a probabal change
Both istio_gateway.go and istio_virtualservice.go have identical targetsFromIngress implementations. The fix is the same in both:
ingress, err := sc.ingressInformer.Lister().Ingresses(namespace).Get(name)
if err != nil {
log.Error(err) // remove this
return nil, err // replace with ↓
}
ingress, err := sc.ingressInformer.Lister().Ingresses(namespace).Get(name)
if err != nil {
log.Warnf("Could not get ingress %s/%s referenced by gateway %s/%s: %v", namespace, name, gateway.Namespace, gateway.Name, err)
return endpoint.Targets{}, nil
}
|
Maybe worth to split bug fixes, as one seems straightforward. |
|
Thanks for the review, I'll split the bugfixes and address the feedback then request a re-review |
What does it do ?
Discovered during development of #6438 which is unlikely to see a merge as-is, so I have separated the fixes into this PR to avoid issues with v0.22.0 when it launches.
Fix 1: fix(istio): resolve annotation prefix not applied to gateway/ingress annotation keys
IstioGatewayIngressSource and K8sGatewaySource were package-level vars evaluated at import time, before --annotation-prefix is applied. This caused Istio gateways using the /ingress or /gateway annotations to silently produce no endpoints, resulting in record deletion under sync policy.
Fix 2: fix(istio): skip gateways with bad annotations instead of crashing the controller
A nonexistent gateway or ingress reference in an annotation caused a fatal error that killed the pod. Now logs a warning and continues processing remaining resources.
Bug information
Bug 1
Supply
--annotation-prefix=external-dns.alpha.kubernetes.io/argument to pod. Start up pod in a cluster whereexternal-dns.alpha.kubernetes.io/ingresstype annotations are in use. Observe the following immediate deletions of resources using this pathway:After fix, these are recreated (or never deleted):
Bug 2
Before fix:
After fix:
Pod does not crash.
Motivation
These bugs stop us using external-dns for the purposes we need it for internally and would do so in quite a destructive way by removing records that are receiving live traffic. The crash is less severe, but provides an avenue for a denial of service attack in multi-tenant clusters.
More