Skip to content

fix: Prevent orphaning records due to label lengths#6431

Open
mmiller-sh wants to merge 1 commit into
kubernetes-sigs:masterfrom
mmiller-sh:fix/orphaned-records
Open

fix: Prevent orphaning records due to label lengths#6431
mmiller-sh wants to merge 1 commit into
kubernetes-sigs:masterfrom
mmiller-sh:fix/orphaned-records

Conversation

@mmiller-sh
Copy link
Copy Markdown

@mmiller-sh mmiller-sh commented May 11, 2026

What does it do ?

This prevents the controller from creating A/AAAA/CNAME records when a corresponding ownership TXT record is unable to be created for the domain.

In cases where the A/AAAA/CNAME record is under the 63 character DNS label length limit, but the generated TXT record label (with aaaa- or cname- prefixes) exceeds the limit, these records will become orphaned by the controller given lack of established ownership.

Motivation

Fixes: #6430

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ivankatliarchuk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the registry Issues or PRs related to a registry label May 11, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 11, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: mmiller-sh / name: Matt Miller (c9b1efb)

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @mmiller-sh!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mmiller-sh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 11, 2026
@mmiller-sh mmiller-sh force-pushed the fix/orphaned-records branch from 755bdef to c9b1efb Compare May 11, 2026 12:19
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 11, 2026
@mmiller-sh mmiller-sh marked this pull request as ready for review May 11, 2026 12:19
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@k8s-ci-robot k8s-ci-robot requested a review from u-kai May 11, 2026 12:19
Copy link
Copy Markdown
Member

@u-kai u-kai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!
Could you take a look at my comment?

Comment thread registry/txt/registry.go
Comment on lines +352 to +356
txts := im.generateTXTRecord(r)
if len(txts) == 0 {
log.Warnf("Skipping create of %s %s: cannot establish ownership TXT (label exceeds RFC 1035 63-char limit)", r.RecordType, r.DNSName)
continue
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understand correct that your fix doesn't bypass the string limit, but simply output log?

Copy link
Copy Markdown
Author

@mmiller-sh mmiller-sh May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prevents the A/AAAA/CNAME records from being created (and orphaned) in the first place when the TXT record names cross the string limit, as well as logging the event.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this change breaks backward compatibility and feels more like a workaround.
I would prefer changing the TXT registry format when the record lenght limit is reached.

What do you think?

Copy link
Copy Markdown
Author

@mmiller-sh mmiller-sh May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a workaround. I believe moving the record type into it's own label when it would overflow, ie: TXT cname.label1.label2.com should fix the immediate issue. Let me look into what might be needed to safely implement this in a non-breaking way.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@u-kai I created a new PR with the alternate approach: #6436

Thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this change breaks backward compatibility and feels more like a workaround. I would prefer changing the TXT registry format when the record lenght limit is reached.

What do you think?

I have not read that. Actually my view is opposite #6436 (comment). Tool should not change the registry format dynamically. This is an operational hell.

Frankly speaking, the user should have a convention, and manage zone naming. When the DNS hits 50 characters, it's already a signal that it's a time to create a new label/zone

@mmiller-sh
Copy link
Copy Markdown
Author

Reopening based on further feedback on the alternate approach. #6436 (comment)

@mmiller-sh mmiller-sh reopened this May 15, 2026
@mloiseleur
Copy link
Copy Markdown
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 16, 2026
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 25920478361

Coverage increased (+0.01%) to 80.632%

Details

  • Coverage increased (+0.01%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 21401
Covered Lines: 17256
Line Coverage: 80.63%
Coverage Strength: 1451.39 hits per line

💛 - Coveralls

Copy link
Copy Markdown
Member

@ivankatliarchuk ivankatliarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may take a while, as it changes behaviour regardless of how we are going to implement it.

Most likely we should consider a new flag, something like.

--strict-label-compliance
--strict-dns-compliance

Depends what other think about that. I know internally we had a discussion while back, to find ways to reduce number of flags. So No flag, no config, always enforced.o flag, no config, always enforced is an option too.

A new flag implies this is optional or user-configurable behavior. But if ExternalDNS can't establish ownership it most likely should never silently create the record - that's a correctness guarantee, not a preference.

Comment thread registry/txt/registry.go
Comment on lines +344 to +363
for _, r := range pendingCreate {
if r.Labels == nil {
r.Labels = make(map[string]string)
}
r.Labels[endpoint.OwnerLabelKey] = im.ownerID

filteredChanges.Create = append(filteredChanges.Create, im.generateTXTRecordWithFilter(r, im.existingTXTs.isAbsent)...)
// Skip records whose ownership TXT cannot be established; creating
// them would leak an unreclaimable record into the zone.
txts := im.generateTXTRecord(r)
if len(txts) == 0 {
log.Warnf("Skipping create of %s %s: cannot establish ownership TXT (label exceeds RFC 1035 63-char limit)", r.RecordType, r.DNSName)
continue
}

filteredChanges.Create = append(filteredChanges.Create, r)
for _, txt := range txts {
if im.existingTXTs.isAbsent(txt) {
filteredChanges.Create = append(filteredChanges.Create, txt)
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change manually re-inlines the isAbsent filter after splitting from generateTXTRecordWithFilter. This is fragile if the helper ever gains new behavior. Not sure if it should be even here

Comment thread registry/txt/registry.go
// for each created/deleted record it will also take into account TXT records for creation/deletion
func (im *TXTRegistry) ApplyChanges(ctx context.Context, changes *plan.Changes) error {
filteredChanges := &plan.Changes{
Create: changes.Create,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

Create: endpoint.FilterEndpointsByDNSCompliance(
          func(ep *endpoint.Endpoint) string { return im.mapper.ToTXTName(ep.DNSName, ep.RecordType) },
          changes.Create,
      ),

and

// FilterEndpointsByDNSCompliance filters out endpoints for which the name derived by
  // nameFunc violates RFC 1035 label length constraints (labels must be ≤ 63 characters).
  func FilterEndpointsByDNSCompliance(nameFunc func(*Endpoint) string, eps []*Endpoint) []*Endpoint {
      filtered := make([]*Endpoint, 0, len(eps))
      for _, ep := range eps {
          valid := true
          for label := range strings.SplitSeq(nameFunc(ep), ".") {
              if len(label) > 63 {
                  log.Warnf("Skipping endpoint %s %s: label %q exceeds RFC 1035 63-char limit", ep.RecordType, ep.DNSName, label)
                  valid = false
                  break
              }
          }
          if valid {
              filtered = append(filtered, ep)
          }
      }
      return filtered
  }

To be consistent with the rest

@mloiseleur
Copy link
Copy Markdown
Collaborator

we should consider a new flag [...] always enforced is an option too.

🤔 As you said, @ivankatliarchuk

A user hitting the limit had a label that was already 57+ chars, which is already pushing DNS convention well past reasonable hostname lengths.

So if we were to introduce a new flag, that would be to support this corner case between 57 & 63 ?
Is it really worth it when this usage is already past reasonable hostname lengths ?

It's a common and known pattern to fail gracefully when the user is asking the software to execute unreachable tasks.

=> So to me, this should be resolved with no new flags and a breaking chance with always enforced behavior where it does not create the record and log a clear message when txt registry is enabled and the domain is 57+ chars.

@u-kai Wdyt ? Is there anything missed with this approach?

@ivankatliarchuk
Copy link
Copy Markdown
Member

"It's a common and known pattern to fail gracefully when the user is asking the software to execute unreachable tasks"

This is a tricky one. Not creating an orphaned record is most likely better than creating one. That's the right call - a record with no ownership TXT is worse than no record at all.

The task isn't "unreachable." The A/AAAA/CNAME record itself is perfectly valid — the label is under 63 chars, and created without any problem. What's unreachable is establishing ownership via the TXT registry mechanism. The constraint is an internal implementation detail of external-dns.

"fail gracefully" usually implies the failure is clearly communicated. A log.Warnf buried in controller output is not graceful from a user perspective - the Ingress has no DNS record and use must to grep logs to find out why. A truly graceful failure in a Kubernetes controller context would surface a condition or event on the object with metric, a kubernetes event, or at minimum a log.Errorf. Silent skips with warn-level logs have caused production incidents in controllers before and will cause it most likely in the future.

I'm not against changing the behaviour, If we all think that orphaned records are worse than missing ones, we could have no flag for that.

I've composed some pros/cons for behavior

Orphaned DNS record (current behavior)

Pros:

  • Traffic actually reaches the destination — no service disruption at creation time

Cons:

  • external-dns can never reclaim it; it lives forever in the zone even after the Ingress/Service is deleted
  • Future reconciles see a record with no owner TXT → treat it as foreign → won't touch it
  • Zone accumulates drift; manual cleanup required
  • Could mask the label-length problem entirely - appears to "work" until deletion

No record at all (this PR)

Pros:

  • Zone stays clean, no unmanaged records accumulate
  • Failure is detectable (log, missing DNS, metric, event)
  • Controller remains authoritative -> what it creates, it can delete

Cons:

  • Service is unreachable immediately — real user-facing impact
  • Failure is only a Warnf in controller logs — easy to miss
  • User gets no signal on the Kubernetes resource itself (no event, no condition)

Orphaned record is the worse long-term outcome - it's hidden, permanent, and requires manual intervention (not great for controller). Missing record is painful immediately but at least it's honest and recoverable (fix the name, DNS appears)

The catch is that "no record" is only acceptable if the failure is loud enough to act on. Right now the log level and lack of a Kubernetes event and metric make it too quiet for the severity of the consequence (broken service). That's the gap worth be clear about.

@mmiller-sh
Copy link
Copy Markdown
Author

mmiller-sh commented May 16, 2026

Noting that the controller currently logs the naming violation when it attempts to reconcile the invalid TXT records, it does not log anything about orphaned records though.

What would you think about increasing log level to error when a record would have been orphaned and adding a Prometheus counter that increments on skips? I'm happy to make both of these changes to make it easier to surface failures involving this DNS spec nuance. I have mild hesitations around adding a metric specific to this corner case/bug fix, but can rationalize it to myself given it would be a single counter. Let me know if that seems reasonable, thank you!

@ivankatliarchuk
Copy link
Copy Markdown
Member

I'm just sharing thoughts and ideas at the moment

@u-kai
Copy link
Copy Markdown
Member

u-kai commented May 17, 2026

=> So to me, this should be resolved with no new flags and a breaking chance with always enforced behavior where it does not create the record and log a clear message when txt registry is enabled and the domain is 57+ chars.

I agree with this approach for simplicity.

@mmiller-sh
Sorry for the extra back-and-forth from my earlier suggestion.

I can see the concern about dynamically changing the registry format now. Keeping the behavior simple and narrowing the supported scope in ExternalDNS instead of introducing more operational complexity sounds like the better approach to me as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. registry Issues or PRs related to a registry size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

registry/txt: parent A/AAAA created without TXT ownership companion when registry name overflows 63 chars

6 participants