Skip to content

backend: k8cache: Fix runWatcher goroutines leak when clusters are removed#5585

Open
Joshna907 wants to merge 2 commits into
kubernetes-sigs:mainfrom
Joshna907:fix/backend-leak-goroutines
Open

backend: k8cache: Fix runWatcher goroutines leak when clusters are removed#5585
Joshna907 wants to merge 2 commits into
kubernetes-sigs:mainfrom
Joshna907:fix/backend-leak-goroutines

Conversation

@Joshna907
Copy link
Copy Markdown
Contributor

Summary: This PR addresses a critical resource leak in the backend where runWatcher goroutines responsible for watching Kubernetes events were never stopped when clusters were removed from the configuration. By implementing an observable listener pattern in the ContextStore, the backend now accurately triggers a synchronization function to cancel invalid watcher contexts, preventing indefinite goroutine persistence and network connection leaks.

Fixes: #5415

Changes: backend/pkg/k8cache/cacheInvalidation.go:

Added a SyncWatchers function that compares the active cluster list against the contextCancel registry.
Implemented logic to selectively invoke cancel() functions for context keys that are no longer present in the active configuration.

backend/pkg/kubeconfig/contextStore.go:

Implemented an observer pattern (AddListener and notifyListeners) for the ContextStore to emit events upon configuration changes.
Updated AddContext, RemoveContext, and AddContextWithKeyAndTTL to safely notify listeners using a mutex without introducing circular dependencies.

backend/cmd/server.go:

Wired k8cache.SyncWatchers to the kubeConfigStore listener during initialization, ensuring the cleanup of orphaned goroutines is triggered exactly when cluster configurations change.

backend/pkg/k8cache/authorization_test.go:

Removed unused sync and sync/atomic imports to maintain clean builds and test parity.

Steps to Test:

  1. Navigate to the backend directory.
  2. Run the tests for the modified packages: go test -v ./pkg/kubeconfig/... ./pkg/k8cache/...
  3. Verify that the server still builds correctly: go build ./...
  4. (Optional) Start the Headlamp backend with multiple clusters, remove or rename a cluster in the configuration, and observe pprof to verify that the corresponding watcher goroutine and network connections are cleanly shut down.

Validation Results:

  1. Unit Tests: ok for all backend packages (Tests passed successfully).
  2. Build: Success.
  3. Lint: Code compiled without unused imports or linting errors.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Joshna907
Once this PR has been reviewed and has the lgtm label, please assign yolossn for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2026
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from 8cb1298 to 76309bf Compare May 12, 2026 12:11
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2026
@illume illume requested a review from Copilot May 12, 2026 13:52
Copy link
Copy Markdown
Contributor

@illume illume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR.

A few of the commits don't quite follow the project guidelines. We use Linux kernel style for git commits — have a look at the contributing guide and previous commits with git log.

Commits that need attention
  • fix(backend): clean up watcher goroutines when clusters are removed — Missing area: description prefix — e.g. frontend: HomeButton: Fix so it navigates to home or backend: config: Add enable-dynamic-clusters flag.
Commit guidelines
  • Use atomic commits focused on a single change.
  • Use the title format <area>: <Description of changes> — description must start with a capital letter.
  • Keep the title under 72 characters (soft requirement).
  • Explain the intention and why the change is needed.
  • Make commit titles meaningful and describe what changed.
  • Do not add code that a later commit rewrites; squash or reorder commits instead.
  • Do not include Fixes #NN in commit messages.

Good examples:

  • frontend: HomeButton: Fix so it navigates to home
  • backend: config: Add enable-dynamic-clusters flag

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent backend runWatcher goroutine/network-connection leaks by canceling watcher contexts when clusters (contexts) are removed from the active kubeconfig configuration, using a new ContextStore change-listener hook to trigger a watcher sync.

Changes:

  • Added a listener mechanism to ContextStore to notify on context add/remove/set-with-ttl.
  • Introduced k8cache.SyncWatchers(activeContexts) to cancel watchers whose context keys are no longer active.
  • Wired SyncWatchers to run on ContextStore changes during backend initialization.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
backend/pkg/kubeconfig/contextStore.go Adds listener registration + async notifications on context mutations.
backend/pkg/k8cache/cacheInvalidation.go Adds SyncWatchers to cancel orphaned watcher contexts.
backend/cmd/server.go Registers a ContextStore listener to call SyncWatchers when contexts change.

Comment thread backend/cmd/server.go
Comment thread backend/cmd/server.go Outdated
Comment thread backend/pkg/kubeconfig/contextStore.go Outdated
Comment thread backend/pkg/kubeconfig/contextStore.go
Comment thread backend/pkg/k8cache/cacheInvalidation.go
Comment thread backend/pkg/k8cache/cacheInvalidation.go
Comment thread backend/pkg/kubeconfig/contextStore.go Outdated
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from 86c2e0f to 1537a66 Compare May 12, 2026 14:01
@illume illume requested a review from Copilot May 12, 2026 15:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comment thread backend/cmd/server.go Outdated
Comment thread backend/cmd/server.go Outdated
Comment thread backend/pkg/kubeconfig/contextStore.go Outdated
Comment thread backend/pkg/kubeconfig/contextStore.go
Comment thread backend/pkg/k8cache/cacheInvalidation.go
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from 1537a66 to 9a1b8d1 Compare May 12, 2026 16:21
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 12, 2026
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from cdacc74 to d517f57 Compare May 12, 2026 17:05
@Joshna907
Copy link
Copy Markdown
Contributor Author

ptal now @illume now mostly, copilot will open nitpicking issues

@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from d517f57 to 82b28fb Compare May 12, 2026 17:13
@illume illume requested a review from Copilot May 12, 2026 18:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread backend/pkg/k8cache/concurrency_test.go Outdated
Comment thread backend/pkg/k8cache/cacheInvalidation.go
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from 82b28fb to 701f085 Compare May 13, 2026 00:53
@Joshna907
Copy link
Copy Markdown
Contributor Author

ptal @illume

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comment thread backend/pkg/kubeconfig/contextStore.go
Comment thread backend/pkg/kubeconfig/contextStore.go
Comment thread backend/cmd/server.go Outdated
Comment thread backend/pkg/k8cache/cacheInvalidation.go
Comment thread backend/pkg/k8cache/concurrency_test.go Outdated
@Joshna907
Copy link
Copy Markdown
Contributor Author

ptal @illume

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Comment thread backend/pkg/cache/cache_internal_test.go Outdated
Comment thread backend/pkg/k8cache/authorization.go
Comment thread frontend/src/components/node/Details.tsx
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from d7bdea9 to 79a9f07 Compare May 17, 2026 12:13
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 17, 2026
@Joshna907
Copy link
Copy Markdown
Contributor Author

ptal @illume

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread backend/pkg/cache/cache_internal_test.go
Comment thread backend/pkg/cache/cache.go Outdated
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch 2 times, most recently from 1a04e99 to b4ee6df Compare May 18, 2026 06:20
@Joshna907
Copy link
Copy Markdown
Contributor Author

ptal @illume

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comment thread backend/pkg/kubeconfig/contextStore.go
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from b4ee6df to b58d7da Compare May 18, 2026 07:23
@Joshna907
Copy link
Copy Markdown
Contributor Author

ptal @illume

@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from b58d7da to dd1ba68 Compare May 18, 2026 08:12
@illume illume requested a review from Copilot May 18, 2026 16:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Comment thread backend/pkg/k8cache/cacheInvalidation.go
Comment thread backend/pkg/kubeconfig/contextStore.go
…moved

This commit addresses a critical resource leak in the backend where
runWatcher goroutines responsible for watching Kubernetes events were
never stopped when clusters were removed from the configuration.

By implementing an observable listener pattern in the ContextStore,
the backend now triggers a synchronization function to cancel invalid
watcher contexts, preventing indefinite goroutine persistence and
network connection leaks.

Key Changes:
1. ContextStore Change Notifications:
   - Added listener registration (AddListener) and change notifications
     to kubeconfig.ContextStore.
   - Triggered notification upon context addition, removal, and TTL eviction.
   - Handled nil listeners safely during registration.

2. Watcher Synchronization:
   - Added SyncWatchers function to backend/pkg/k8cache/cacheInvalidation.go
     which compares active contexts against current watchers and invokes the
     corresponding cancel() function for removed contexts.
   - Wired SyncWatchers to run automatically when the ContextStore changes.
   - Redacted user-specific details from logged messages when canceling
     watchers to protect sensitive/PII data.

3. Robust Cache Cleanup & Panic Protection:
   - Wrapped TTL cache eviction callbacks with recover in backend/pkg/cache/cache.go
     to protect the janitor goroutine from unexpected callback panics.

4. Test Coverage:
   - Added comprehensive unit tests in pkg/kubeconfig/contextStore_test.go
     to verify event listener triggers and TTL eviction notifications.
   - Added panic recovery unit tests in pkg/cache/cache_internal_test.go.
   - Fixed potential data race in TestCacheEvictionPanicRecovery by
     replacing the un-synchronized panicked boolean flag with a clean
     channel-based synchronization mechanism.
@Joshna907 Joshna907 force-pushed the fix/backend-leak-goroutines branch from dd1ba68 to d384bb3 Compare May 19, 2026 03:44
@Joshna907
Copy link
Copy Markdown
Contributor Author

ptal @illume

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Comment thread backend/pkg/cache/cache.go
Comment thread backend/pkg/k8cache/cacheStore.go
Copy link
Copy Markdown
Contributor

@illume illume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

Would you mind addressing the open Copilot review comments? Please mark each comment as resolved after addressing it.

@Joshna907
Copy link
Copy Markdown
Contributor Author

ptal @illume

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants