Skip to content

feat(controller): emit events for MCP handshake failures#186

Merged
k8s-ci-robot merged 7 commits into
kubernetes-sigs:mainfrom
ibm-adarsh:emit-events-handshake
May 21, 2026
Merged

feat(controller): emit events for MCP handshake failures#186
k8s-ci-robot merged 7 commits into
kubernetes-sigs:mainfrom
ibm-adarsh:emit-events-handshake

Conversation

@ibm-adarsh
Copy link
Copy Markdown
Contributor

Summary

Part of #109 (Kubernetes event coverage).

Adds two Warning events for the MCP handshake lifecycle:

Event Type Reason When
MCP handshake failed Warning MCPEndpointUnavailable Handshake fails while deployment is healthy
MCP handshake retries exhausted Warning MCPEndpointUnavailable handshakeRetryCount reaches maxMCPHandshakeRetries (10) and the controller stops requeuing

Builds on #118 (configuration events) and #184 (server ready event).

Changes

  • Emit MCPHandshakeFailed with the Ready condition message; deduplicated when reason and message are unchanged (same pattern as configuration invalid in Wire EventRecorder into MCPServer reconciler #118).
  • Emit MCPHandshakeRetriesExhausted once when retry count crosses the max threshold.
  • Add duplicateHandshakeUnavailable() helper and unit/envtest coverage.

Test plan

  • go test ./internal/controller/... — all specs pass
  • Kind: wrong MCP path → handshake failed Warning(s); one retries-exhausted Warning after 10 attempts
  • Kind: valid MCP server → Valid + Available events; no retries-exhausted event
  • Re-reconcile does not emit duplicate exhausted event

Manual verification

image image

Emit Warning events when the MCP handshake fails (deduplicated by
condition message) and when handshake retries are exhausted. Part of kubernetes-sigs#109.
@netlify
Copy link
Copy Markdown

netlify Bot commented May 18, 2026

Deploy Preview for mcp-lifecycle-operator ready!

Name Link
🔨 Latest commit 627be9c
🔍 Latest deploy log https://app.netlify.com/projects/mcp-lifecycle-operator/deploys/6a0f177ead5f740008e084ca
😎 Deploy Preview https://deploy-preview-186--mcp-lifecycle-operator.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested review from matzew and soltysh May 18, 2026 13:47
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 18, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @ibm-adarsh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 18, 2026
Extract reconcileHandshakeEventsAndRetryCount to keep Reconcile under
complexity 30; use range over int in handshake test.
@ibm-adarsh
Copy link
Copy Markdown
Contributor Author

ibm-adarsh commented May 18, 2026

Notes (expected behavior)

  • Two “handshake failed” warnings can appear if the error message changes (e.g. connection refused → session not found). That matches the invalid-config dedupe pattern, not a bug.
  • handshakeRetryCount in status can grow past 10 on extra reconciles; the exhausted event still fires only once when crossing 10.
  • Happy path (valid MCP): you get Valid + Available, no exhausted event.

@ibm-adarsh
Copy link
Copy Markdown
Contributor Author

Hi @matzew Can I get ok-to-test label and any suggestion here, please?

if r.Recorder == nil {
return
}
r.Recorder.Eventf(mcpServer, nil, corev1.EventTypeWarning, ReasonMCPEndpointUnavailable, eventActionMCPHandshakeFailed, "%s", message)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the emitMCPHandshakeRetriesExhausted the name of the mcp server is used - perhaps we can do here too for more consistent approach?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @matzew for pointing it out.
Done in 5fca5f7 — emitMCPHandshakeFailed now includes the MCPServer name. Updated configuration accepted/invalid messages in the same PR for consistency across all emitted events.

Address review feedback on kubernetes-sigs#186 and align event messages with
emitServerReady and emitMCPHandshakeRetriesExhausted.
return
}
r.Recorder.Eventf(mcpServer, nil, corev1.EventTypeNormal, ReasonValid, eventActionConfigurationAccepted, "%s", "MCPServer configuration is valid; Accepted=True")
r.Recorder.Eventf(mcpServer, nil, corev1.EventTypeNormal, ReasonValid, eventActionConfigurationAccepted,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maintain the consistency across all events that we are emitting.

@Cali0707
Copy link
Copy Markdown
Contributor

@ibm-adarsh I think my last PR caused conflicts here (splitting up the mcpserver_controller.go file) 😓

Would you mind fixing those + pulling the handshake logic you have into the new `mcpserver_controller_handshake.go file) ?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 20, 2026
Resolve conflict with kubernetes-sigs#190 by keeping the refactored controller layout
and restoring MCP handshake failed/retries-exhausted events plus
MCPServer name in configuration event messages.
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2026
@aliok aliok requested a review from Copilot May 21, 2026 06:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the MCPServer controller’s Kubernetes event coverage by emitting Warning events when the MCP handshake fails and when handshake retries are exhausted, with deduplication to avoid event spam.

Changes:

  • Add Warning events for MCP handshake failures and for “retries exhausted” when the max retry threshold is reached.
  • Refactor handshake retry counting + event emission into a helper (reconcileHandshakeEventsAndRetryCount) and add a dedupe helper (duplicateHandshakeUnavailable).
  • Update and extend unit/envtest coverage around handshake event emission and include MCPServer name in configuration accepted event assertions.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/controller/mcpserver_controller.go Adds handshake-related event actions and emitters; factors retry/event logic into a helper call; adjusts configuration event messages.
internal/controller/mcpserver_controller_test.go Updates assertions to reflect the updated event message content (includes MCPServer name).
internal/controller/mcpserver_controller_handshake.go Implements reconcileHandshakeEventsAndRetryCount to emit handshake events and compute retry count.
internal/controller/mcpserver_controller_handshake_test.go Adds tests for handshake-failed event deduping and retries-exhausted event emission.
internal/controller/mcpserver_controller_conditions.go Adds duplicateHandshakeUnavailable helper for deduping handshake-failed events.
internal/controller/mcpserver_controller_conditions_test.go Adds unit test coverage for duplicateHandshakeUnavailable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controller/mcpserver_controller_handshake_test.go
@ibm-adarsh
Copy link
Copy Markdown
Contributor Author

@ibm-adarsh I think my last PR caused conflicts here (splitting up the mcpserver_controller.go file) 😓

Would you mind fixing those + pulling the handshake logic you have into the new `mcpserver_controller_handshake.go file) ?

DONE!

Extend the MCPHandshakeFailed dedupe test to cover both no-duplicate
when the error is unchanged and a new event when the message changes,
matching the test name and Copilot review feedback on kubernetes-sigs#186.
@ibm-adarsh
Copy link
Copy Markdown
Contributor Author

cc : @matzew @aliok

case ev := <-fr.Events:
collected = append(collected, ev)
default:
goto check
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get rid of this goto, we can use something like this:

func drainEvents(ch <-chan string) []string {
      var events []string
      for {
          select {
          case ev := <-ch:
              events = append(events, ev)
          default:
              return events
          }
      }
  }

  Then the test becomes:

  Eventually(func(g Gomega) {
      collected = drainEvents(fr.Events)
      exhausted := 0
      for _, ev := range collected {
          if strings.Contains(ev, "retries exhausted") {
              exhausted++
          }
      }
      g.Expect(exhausted).To(Equal(1))
  }).Should(Succeed())

Alternative is using a labeled break, which is also not preferred.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE!

@aliok
Copy link
Copy Markdown
Member

aliok commented May 21, 2026

@ibm-adarsh I have a single comment above, which I think is blocking for this PR.
Otherwise LGTM

Address aliok review on kubernetes-sigs#186 by draining fake recorder events via a
shared helper instead of goto in the retries-exhausted assertion.
Rename local slice to avoid shadowing k8s.io/client-go/tools/events.
@ibm-adarsh
Copy link
Copy Markdown
Contributor Author

@ibm-adarsh I have a single comment above, which I think is blocking for this PR. Otherwise LGTM

Done, Please take a look once you have sometime.

Copy link
Copy Markdown
Member

@matzew matzew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 21, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ibm-adarsh, matzew

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2026
@k8s-ci-robot k8s-ci-robot merged commit 629fe37 into kubernetes-sigs:main May 21, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants