fix: Prevent orphaning records due to label lengths#6431
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
|
Welcome @mmiller-sh! |
|
Hi @mmiller-sh. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
755bdef to
c9b1efb
Compare
u-kai
left a comment
There was a problem hiding this comment.
Thank you for your contribution!
Could you take a look at my comment?
| txts := im.generateTXTRecord(r) | ||
| if len(txts) == 0 { | ||
| log.Warnf("Skipping create of %s %s: cannot establish ownership TXT (label exceeds RFC 1035 63-char limit)", r.RecordType, r.DNSName) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Is my understand correct that your fix doesn't bypass the string limit, but simply output log?
There was a problem hiding this comment.
It prevents the A/AAAA/CNAME records from being created (and orphaned) in the first place when the TXT record names cross the string limit, as well as logging the event.
There was a problem hiding this comment.
In my opinion, this change breaks backward compatibility and feels more like a workaround.
I would prefer changing the TXT registry format when the record lenght limit is reached.
What do you think?
There was a problem hiding this comment.
I agree this is a workaround. I believe moving the record type into it's own label when it would overflow, ie: TXT cname.label1.label2.com should fix the immediate issue. Let me look into what might be needed to safely implement this in a non-breaking way.
There was a problem hiding this comment.
In my opinion, this change breaks backward compatibility and feels more like a workaround. I would prefer changing the TXT registry format when the record lenght limit is reached.
What do you think?
I have not read that. Actually my view is opposite #6436 (comment). Tool should not change the registry format dynamically. This is an operational hell.
Frankly speaking, the user should have a convention, and manage zone naming. When the DNS hits 50 characters, it's already a signal that it's a time to create a new label/zone
|
Reopening based on further feedback on the alternate approach. #6436 (comment) |
|
/ok-to-test |
Coverage Report for CI Build 25920478361Coverage increased (+0.01%) to 80.632%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
ivankatliarchuk
left a comment
There was a problem hiding this comment.
This change may take a while, as it changes behaviour regardless of how we are going to implement it.
Most likely we should consider a new flag, something like.
--strict-label-compliance
--strict-dns-compliance
Depends what other think about that. I know internally we had a discussion while back, to find ways to reduce number of flags. So No flag, no config, always enforced.o flag, no config, always enforced is an option too.
A new flag implies this is optional or user-configurable behavior. But if ExternalDNS can't establish ownership it most likely should never silently create the record - that's a correctness guarantee, not a preference.
| for _, r := range pendingCreate { | ||
| if r.Labels == nil { | ||
| r.Labels = make(map[string]string) | ||
| } | ||
| r.Labels[endpoint.OwnerLabelKey] = im.ownerID | ||
|
|
||
| filteredChanges.Create = append(filteredChanges.Create, im.generateTXTRecordWithFilter(r, im.existingTXTs.isAbsent)...) | ||
| // Skip records whose ownership TXT cannot be established; creating | ||
| // them would leak an unreclaimable record into the zone. | ||
| txts := im.generateTXTRecord(r) | ||
| if len(txts) == 0 { | ||
| log.Warnf("Skipping create of %s %s: cannot establish ownership TXT (label exceeds RFC 1035 63-char limit)", r.RecordType, r.DNSName) | ||
| continue | ||
| } | ||
|
|
||
| filteredChanges.Create = append(filteredChanges.Create, r) | ||
| for _, txt := range txts { | ||
| if im.existingTXTs.isAbsent(txt) { | ||
| filteredChanges.Create = append(filteredChanges.Create, txt) | ||
| } | ||
| } |
There was a problem hiding this comment.
The change manually re-inlines the isAbsent filter after splitting from generateTXTRecordWithFilter. This is fragile if the helper ever gains new behavior. Not sure if it should be even here
| // for each created/deleted record it will also take into account TXT records for creation/deletion | ||
| func (im *TXTRegistry) ApplyChanges(ctx context.Context, changes *plan.Changes) error { | ||
| filteredChanges := &plan.Changes{ | ||
| Create: changes.Create, |
There was a problem hiding this comment.
Something like
Create: endpoint.FilterEndpointsByDNSCompliance(
func(ep *endpoint.Endpoint) string { return im.mapper.ToTXTName(ep.DNSName, ep.RecordType) },
changes.Create,
),
and
// FilterEndpointsByDNSCompliance filters out endpoints for which the name derived by
// nameFunc violates RFC 1035 label length constraints (labels must be ≤ 63 characters).
func FilterEndpointsByDNSCompliance(nameFunc func(*Endpoint) string, eps []*Endpoint) []*Endpoint {
filtered := make([]*Endpoint, 0, len(eps))
for _, ep := range eps {
valid := true
for label := range strings.SplitSeq(nameFunc(ep), ".") {
if len(label) > 63 {
log.Warnf("Skipping endpoint %s %s: label %q exceeds RFC 1035 63-char limit", ep.RecordType, ep.DNSName, label)
valid = false
break
}
}
if valid {
filtered = append(filtered, ep)
}
}
return filtered
}
To be consistent with the rest
🤔 As you said, @ivankatliarchuk
So if we were to introduce a new flag, that would be to support this corner case between 57 & 63 ? It's a common and known pattern to fail gracefully when the user is asking the software to execute unreachable tasks. => So to me, this should be resolved with no new flags and a breaking chance with always enforced behavior where it does not create the record and log a clear message when txt registry is enabled and the domain is 57+ chars. @u-kai Wdyt ? Is there anything missed with this approach? |
|
"It's a common and known pattern to fail gracefully when the user is asking the software to execute unreachable tasks" This is a tricky one. Not creating an orphaned record is most likely better than creating one. That's the right call - a record with no ownership TXT is worse than no record at all. The task isn't "unreachable." The A/AAAA/CNAME record itself is perfectly valid — the label is under 63 chars, and created without any problem. What's unreachable is establishing ownership via the TXT registry mechanism. The constraint is an internal implementation detail of external-dns. "fail gracefully" usually implies the failure is clearly communicated. A log.Warnf buried in controller output is not graceful from a user perspective - the Ingress has no DNS record and use must to grep logs to find out why. A truly graceful failure in a Kubernetes controller context would surface a condition or event on the object with metric, a kubernetes event, or at minimum a log.Errorf. Silent skips with warn-level logs have caused production incidents in controllers before and will cause it most likely in the future. I'm not against changing the behaviour, If we all think that orphaned records are worse than missing ones, we could have no flag for that. I've composed some pros/cons for behavior Orphaned DNS record (current behavior) Pros:
Cons:
No record at all (this PR) Pros:
Cons:
Orphaned record is the worse long-term outcome - it's hidden, permanent, and requires manual intervention (not great for controller). Missing record is painful immediately but at least it's honest and recoverable (fix the name, DNS appears) The catch is that "no record" is only acceptable if the failure is loud enough to act on. Right now the log level and lack of a Kubernetes event and metric make it too quiet for the severity of the consequence (broken service). That's the gap worth be clear about. |
|
Noting that the controller currently logs the naming violation when it attempts to reconcile the invalid TXT records, it does not log anything about orphaned records though. What would you think about increasing log level to error when a record would have been orphaned and adding a Prometheus counter that increments on skips? I'm happy to make both of these changes to make it easier to surface failures involving this DNS spec nuance. I have mild hesitations around adding a metric specific to this corner case/bug fix, but can rationalize it to myself given it would be a single counter. Let me know if that seems reasonable, thank you! |
|
I'm just sharing thoughts and ideas at the moment |
I agree with this approach for simplicity. @mmiller-sh I can see the concern about dynamically changing the registry format now. Keeping the behavior simple and narrowing the supported scope in ExternalDNS instead of introducing more operational complexity sounds like the better approach to me as well. |
What does it do ?
This prevents the controller from creating A/AAAA/CNAME records when a corresponding ownership TXT record is unable to be created for the domain.
In cases where the A/AAAA/CNAME record is under the 63 character DNS label length limit, but the generated TXT record label (with
aaaa-orcname-prefixes) exceeds the limit, these records will become orphaned by the controller given lack of established ownership.Motivation
Fixes: #6430
More