Skip to content

🌱 Sync with upstream open-cluster-management-io/cluster-proxy 20260427#568

Merged
openshift-merge-bot[bot] merged 3 commits into
stolostron:mainfrom
tesshuflower:stolostron-sync
Apr 28, 2026
Merged

🌱 Sync with upstream open-cluster-management-io/cluster-proxy 20260427#568
openshift-merge-bot[bot] merged 3 commits into
stolostron:mainfrom
tesshuflower:stolostron-sync

Conversation

@tesshuflower
Copy link
Copy Markdown

@tesshuflower tesshuflower commented Apr 27, 2026

Summary

Merge upstream open-cluster-management-io/cluster-proxy main to bring in 2 commits:

  • Add TokenReview result caching to service-proxy (upstream #282) — LRU+TTL cache for TokenReview results, raises default client-go QPS/burst to 50/100 to prevent client-side throttling under high concurrency
  • Support setting TLS profile (upstream #285) — Adds TLS profile ConfigMap watcher to all HTTPS-serving components (addon-manager, user-server, service-proxy). Related Jira: ACM-31249

Conflicts in go.mod, go.sum, and vendor resolved by taking upstream versions.

Test plan

  • Verify upstream CI passed on both PRs
  • Verify downstream-only files (.tekton/, renovate.json, sonar-project.properties, cmd/Dockerfile.rhtap, OWNERS) are intact
  • Presubmit CI passes

xuezhaojun and others added 3 commits April 2, 2026 02:05
…ent-io#282)

* Add TokenReview result caching to reduce API server load

The service-proxy performs TokenReview API calls for every incoming request
without any caching. Under high concurrency (e.g., ArgoCD push model traffic
routed through cluster-proxy), this creates excessive load on both the managed
cluster and hub API servers, leading to significant latency degradation.

This change introduces a TTL-based in-memory cache for TokenReview results:
- Caches both authenticated and unauthenticated results
- Default TTL of 5 minutes, configurable via --token-review-cache-ttl flag
- Set TTL to 0 to disable caching entirely
- Token hashed (SHA-256) before use as cache key for security
- Thread-safe with automatic expired entry eviction

Signed-off-by: Zhao Xue <zxue@redhat.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>

* Upgrade TokenReview cache from pure TTL to LRU+TTL

Add LRU eviction to cap memory usage under high cardinality token
scenarios. When the cache exceeds maxSize (default 1000), the least
recently used entry is evicted. TTL expiration is preserved as before.

Signed-off-by: Zhao Xue <zxue@redhat.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>

* Fix code review issues: context lifecycle, cache pollution, deep copy

Address issues found during code review:

1. Goroutine leak: eviction goroutine now accepts context.Context and
   stops when context is cancelled, preventing leaks on shutdown or
   cache re-creation.

2. Cache pollution: no longer cache unauthenticated (negative) results.
   Previously an attacker could flood the cache with invalid tokens to
   evict legitimate entries via LRU. Now only authenticated results are
   cached, so the cache cannot be polluted with arbitrary tokens.

3. Deep copy on Get: return user.DeepCopy() from Get to prevent callers
   from accidentally mutating cached state.

4. Fix misleading comment in eviction loop: LRU order != TTL order.

Signed-off-by: Zhao Xue <zxue@redhat.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>

* Fix misspell lint: cancelled -> canceled

Signed-off-by: Zhao Xue <zxue@redhat.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>

* Increase default kube client QPS/burst for service-proxy

The default client-go rate limit (QPS=5, burst=10) causes client-side
throttling under high-concurrency workloads, resulting in request delays
of 1min+ as TokenReview calls queue up internally. Raise the defaults
to QPS=50 and burst=100, and expose --kube-api-qps / --kube-api-burst
flags for tuning.

Signed-off-by: Zhao Xue <zxue@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>

* Replace custom TokenReview cache with official k8s.io/apiserver token cache

Switch from our custom LRU+TTL cache to the battle-tested
k8s.io/apiserver/pkg/authentication/token/cache package used by
kube-apiserver itself. Key improvements:

- singleflight: concurrent requests for the same token share one API
  call (vs our cache which only helped after the first response)
- Striped cache with HMAC-SHA256 key derivation
- Reduce default TTL from 5min to 10s (singleflight handles burst dedup)
- Log managed cluster authenticated identity for debugging parity

Signed-off-by: Zhao Xue <zxue@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>

* Cache TokenReview failure results to eliminate singleflight cold-start spikes

Set failureTTL = successTTL, matching kube-apiserver best practice
(see k8s.io/apiserver/pkg/authentication/authenticatorfactory/delegating.go).

Without failure caching, hub tokens that always fail managed cluster
auth trigger a new singleflight group after each completion, causing
8s+ latency spikes at c=100. With failure caching, the "not authenticated"
result is cached and subsequent requests skip the API call entirely.

Signed-off-by: Zhao Xue <zxue@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>

* Address CodeRabbit review: handle TokenReview Status.Error and make token reader injectable

- Return error when TokenReview Status.Error is set instead of treating
  it as unauthenticated, so managed cluster backend failures are not
  silently swallowed and don't trigger incorrect hub auth fallback.
- Make getImpersonateToken injectable (getImpersonateTokenFunc) so tests
  don't depend on the host's service-account token file mount.
- Add tests for Status.Error propagation and token reader error path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>

---------

Signed-off-by: Zhao Xue <zxue@redhat.com>
Signed-off-by: xuezhaojun <zxue@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…io#285)

* feat: add TLS profile support to user-server

- update sdk-go in go.mod to include pkg/tls library
- add TLS profile ConfigMap watcher to user-server
- add POD_NAMESPACE env var to user-server deployment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Tesshu Flower <tflower@redhat.com>

* health probes pass tlsconfig for consistency

- health probes are using HTTP however the setup of the server
sets a default TLSConfig anyway - to reduce confusion, pass through
our custom TLSConfig rather than hardcoding a default.

Signed-off-by: Tesshu Flower <tflower@redhat.com>

* feat: add TLS profile support to service-proxy

- Add TLS ConfigMap watcher to service-proxy using sdk-go pkg/tls
- Add POD_NAMESPACE env var to service-proxy container via downward API
- Add configmap get/list/watch permissions to addon-agent role

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tesshu Flower <tflower@redhat.com>

* feat: add TLS profile support to addon-manager (proxy-server)

- Watch the ocm-tls-profile ConfigMap via StartTLSConfigMapWatcher and
pass cipher suites to the ANP proxy-server deployment. The addon-manager
restarts on ConfigMap changes, consistent with user-server and
service-proxy.
- TLS min version support for anp-server is prepared but commented out
pending upstream ANP support for --tls-min-version.
- unit test for args to anp-server - including new tlsconfig args (will
  also need to be updated once ANP has support for tls-min-version)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tesshu Flower <tflower@redhat.com>

* Update ensure to propagate TLS config changes to proxy-server deployment

The ensure function only updated resources when the ManagedProxyConfiguration
CR generation bumped, so TLS config changes from the ocm-tls-profile ConfigMap
were never applied to the proxy-server deployment. Add a TLS config hash
annotation to the deployment and trigger updates when the hash differs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tesshu Flower <tflower@redhat.com>

* address some nitpick comments

Signed-off-by: Tesshu Flower <tflower@redhat.com>

---------

Signed-off-by: Tesshu Flower <tflower@redhat.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@tesshuflower tesshuflower changed the title Sync with upstream open-cluster-management-io/cluster-proxy 🌱 Sync with upstream open-cluster-management-io/cluster-proxy Apr 27, 2026
@tesshuflower tesshuflower changed the title 🌱 Sync with upstream open-cluster-management-io/cluster-proxy 🌱 Sync with upstream open-cluster-management-io/cluster-proxy 20260427 Apr 27, 2026
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 70%)
D Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Copy link
Copy Markdown

@mikeshng mikeshng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

/lgtm

@tesshuflower
Copy link
Copy Markdown
Author

/override sonar-pre-submit

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 28, 2026

@tesshuflower: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • sonar-pre-submit

Only the following failed contexts/checkruns were expected:

  • ci/prow/build
  • ci/prow/images
  • ci/prow/integration
  • ci/prow/sonar-pre-submit
  • ci/prow/unit-test
  • pull-ci-stolostron-cluster-proxy-backplane-2.1-build
  • pull-ci-stolostron-cluster-proxy-backplane-2.1-images
  • pull-ci-stolostron-cluster-proxy-backplane-2.1-integration
  • pull-ci-stolostron-cluster-proxy-backplane-2.1-unit-test
  • pull-ci-stolostron-cluster-proxy-main-sonar-pre-submit
  • tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

Details

In response to this:

/override sonar-pre-submit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown

@mikeshng mikeshng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/override ci/prow/sonar-pre-submit

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 28, 2026

@mikeshng: Overrode contexts on behalf of mikeshng: ci/prow/sonar-pre-submit

Details

In response to this:

/override ci/prow/sonar-pre-submit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 28, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mikeshng, tesshuflower

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [mikeshng,tesshuflower]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 44d015f into stolostron:main Apr 28, 2026
13 checks passed
@tesshuflower tesshuflower deleted the stolostron-sync branch April 28, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants