Skip to content

fix(cloudflare): cache ZoneHasPaidPlan to avoid per-change API amplification#6392

Open
AndrewCharlesHay wants to merge 1 commit into
kubernetes-sigs:masterfrom
AndrewCharlesHay:fix/cloudflare-cache-plan-lookup
Open

fix(cloudflare): cache ZoneHasPaidPlan to avoid per-change API amplification#6392
AndrewCharlesHay wants to merge 1 commit into
kubernetes-sigs:masterfrom
AndrewCharlesHay:fix/cloudflare-cache-plan-lookup

Conversation

@AndrewCharlesHay
Copy link
Copy Markdown
Contributor

Addresses the second of the two HIGH items in #6391 (the rate-limit amplification). Leaving the Custom Hostname allowlist for a separate PR once maintainer direction on that is agreed.

What's wrong today

ZoneHasPaidPlan is called from newCloudFlareChange via trimAndValidateComment whenever a record comment exceeds 100 characters. Each call fires a ListZones (paginated) plus a GetZone against the operator's Cloudflare account, with no caching.

Because the comment is set by the external-dns.alpha.kubernetes.io/cloudflare-record-comment annotation, any user with create rights in a watched namespace can set a 101-char comment on N ingresses and cause 2N Cloudflare API calls per sync cycle. Once the operator's account gets rate-limited, real DNS changes come back as SoftError and the cluster's DNS stops converging until Cloudflare cools off.

The fix

Add a small TTL-based in-memory cache keyed by zone name (TLD+1), defaulting to 15 minutes. Plan changes are billing actions by the account owner, so a generous TTL introduces no meaningful staleness and eliminates the per-change fan-out entirely.

  • zone_plan_cache.go — the cache itself, with a now field overridable for deterministic TTL tests.
  • cloudflare.goZoneHasPaidPlan checks the cache first, populates it on miss.
  • The cache is optional: if the zonePlanCache field is nil (hand-constructed provider in existing tests), ZoneHasPaidPlan falls back to the original behavior. This keeps the existing TestZoneHasPaidPlan and other hand-constructed provider tests working unmodified.

Tests

  • TestZonePlanCache_HitAndMiss — basic get/set.
  • TestZonePlanCache_TTLEvicts — uses an injectable clock to advance past the TTL and confirm eviction.
  • TestZoneHasPaidPlan_Cached — populates the cache, then breaks the underlying API and confirms repeated calls still return the cached true (proves the cache actually short-circuits).
  • TestZoneHasPaidPlan_NilCache — confirms the nil-cache path still works end-to-end.

Existing TestZoneHasPaidPlan passes without modification.

Test plan

  • go test ./provider/cloudflare/...
  • go test -race ./provider/cloudflare/...
  • gofmt -l provider/cloudflare/*.go
  • go build ./...

Relates to #6391.

…ication

ZoneHasPaidPlan is called from the long-comment truncation path in
newCloudFlareChange, which runs once per record change. Each call fired
a ListZones + GetZone against the operator's Cloudflare account, so a
user able to set the cloudflare-record-comment annotation to a 101+
char string on N ingresses caused 2N API calls per sync cycle. Once
Cloudflare rate-limited the operator's account, real DNS changes came
back as SoftError and convergence stalled until the throttle lifted.

Add a simple TTL cache keyed by zone (15 min default). Plan changes
are billing actions, so a generous TTL is fine and it kills the
amplification entirely. The cache field is optional: ZoneHasPaidPlan
still works against a hand-constructed provider (nil cache) so
existing tests keep compiling without modification.

Fixes the rate-limit half of kubernetes-sigs#6391. The Custom Hostname allowlist
(the other half) is tracked in the same issue and will follow up
once the maintainer direction is agreed.

Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 24, 2026
@k8s-ci-robot k8s-ci-robot added the provider Issues or PRs related to a provider label Apr 24, 2026
@k8s-ci-robot k8s-ci-robot requested a review from mloiseleur April 24, 2026 02:01
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ivankatliarchuk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 24, 2026
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 24868400870

Coverage increased (+0.03%) to 80.494%

Details

  • Coverage increased (+0.03%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • 31 coverage regressions across 1 file.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

31 previously-covered lines in 1 file lost coverage.

File Lines Losing Coverage Coverage
cloudflare/cloudflare.go 31 93.01%

Coverage Stats

Coverage Status
Relevant Lines: 21409
Covered Lines: 17233
Line Coverage: 80.49%
Coverage Strength: 1464.54 hits per line

💛 - Coveralls

@ivankatliarchuk
Copy link
Copy Markdown
Member

we have a blueprint for cache https://github.com/kubernetes-sigs/external-dns/tree/master/provider/blueprint

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. provider Issues or PRs related to a provider size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants