Skip to content

feat(interceptor): add SSE streaming callback for cold-start scaling#1592

Open
guimou wants to merge 1 commit into
kedacore:mainfrom
rh-aiservices-bu:feat/llm-callback
Open

feat(interceptor): add SSE streaming callback for cold-start scaling#1592
guimou wants to merge 1 commit into
kedacore:mainfrom
rh-aiservices-bu:feat/llm-callback

Conversation

@guimou
Copy link
Copy Markdown

@guimou guimou commented Apr 20, 2026

Summary

When serving LLM inference workloads behind KEDA HTTP Add-on, scaling from zero can take a significant amount of time (30s–several minutes) while the model loads into GPU memory. During this cold start, streaming clients (e.g. chat UIs, CLI tools using the OpenAI-compatible API) hold an open connection with no feedback, leading to:

  • User confusion: no indication that the request is being processed.
  • Premature timeouts: load balancers, proxies, or clients may drop the connection before the backend is ready.
  • Poor developer experience: no way to distinguish "waiting for scale-up" from "request lost."

This PR adds an optional SSE streaming callback to the interceptor's cold-start path. When configured and the incoming request is a streaming request ("stream": true in the JSON body), the interceptor sends OpenAI-compatible chat.completion.chunk SSE keepalive events to the client while waiting for the backend to become ready. Once the backend is available, it inserts a visual separator and proxies the real upstream response through the same SSE stream.

Configuration

  • HTTPScaledObject (v1alpha1): new coldStartStreamingCallback field with message, intervalSeconds, and keepaliveMessage.
  • InterceptorRoute (v1beta1): new streamingCallback field inside coldStart with message, interval, and keepaliveMessage.
  • The coldStart validation is relaxed from requiring fallback to requiring at least one of fallback or streamingCallback.

Example output (CLI client)

Model is waking up, hold on.........

---
Hello! How can I help you today?

Changes

  • Interceptor middleware (interceptor/middleware/endpoint_resolver.go): new serveStreamingCallback method, isStreamingRequest body parser, writeSSEEvent SSE formatter, and headerSuppressingWriter to avoid duplicate WriteHeader from the upstream proxy.
  • CRD types: ColdStartStreamingCallbackSpec (v1alpha1), StreamingCallbackSpec (v1beta1), plus generated DeepCopy methods.
  • CRD manifests: updated via controller-gen.
  • Routing table (pkg/routing/table.go): converts v1alpha1 coldStartStreamingCallback to v1beta1 streamingCallback during refreshMemory.
  • Tests: unit tests covering cold-start streaming, non-streaming requests, timeout, and unconfigured paths.

Test plan

  • make test passes (unit tests including new streaming callback tests)
  • make lint passes
  • Deploy to a cluster with an LLM workload scaled to zero, send a streaming chat completion request, and verify keepalive SSE events are received until the backend is ready
  • Verify non-streaming requests still follow the normal cold-start/fallback path
  • Verify that when streamingCallback is not configured, behavior is unchanged
  • Verify the separator and model response appear on separate lines in CLI clients

When a backend is scaling from zero and the incoming request is a
streaming request (JSON body with "stream": true), the interceptor can
now send OpenAI-compatible SSE keepalive events to keep the client
connection alive until the backend becomes ready.

This is configured via a new `coldStartStreamingCallback` field on
HTTPScaledObject (v1alpha1) and a `streamingCallback` field inside
`coldStart` on InterceptorRoute (v1beta1). The spec includes the
initial message, keepalive interval, and optional keepalive message
text. Once the backend is ready, the interceptor inserts a visual
separator and proxies the real upstream response through the same
SSE stream.

Signed-off-by: Guillaume Moutier <guimou@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 20, 2026 21:41
@guimou guimou requested a review from a team as a code owner April 20, 2026 21:41
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 20, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional Server-Sent Events (SSE) “streaming callback” during cold starts so OpenAI-style streaming clients receive periodic chat.completion.chunk keepalive events while the backend scales from zero, then continues streaming the real upstream response once ready.

Changes:

  • Add interceptor support for detecting streaming requests and emitting SSE keepalive/separator events during cold-start waits.
  • Extend CRD APIs/manifests (v1alpha1 HTTPScaledObject and v1beta1 InterceptorRoute) to configure the streaming callback and relax cold-start validation to allow either fallback or streamingCallback.
  • Convert v1alpha1 coldStartStreamingCallback into v1beta1 coldStart.streamingCallback in the routing table refresh logic, plus unit tests for the new behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
interceptor/middleware/endpoint_resolver.go Adds SSE callback flow for streaming requests during cold starts, plus request-body streaming detection and a ResponseWriter wrapper.
interceptor/middleware/endpoint_resolver_test.go Adds unit tests for streaming callback behavior and request-body detection.
operator/apis/http/v1beta1/interceptorroute_types.go Introduces StreamingCallbackSpec and relaxes coldStart validation.
operator/apis/http/v1beta1/zz_generated.deepcopy.go Generated DeepCopy updates for the new v1beta1 types.
operator/apis/http/v1alpha1/httpscaledobject_types.go Adds ColdStartStreamingCallbackSpec and a new HTTPSO spec field.
operator/apis/http/v1alpha1/zz_generated.deepcopy.go Generated DeepCopy updates for the new v1alpha1 types.
pkg/routing/table.go Converts HTTPSO coldStartStreamingCallback into IR coldStart.streamingCallback during refresh.
config/crd/bases/http.keda.sh_interceptorroutes.yaml CRD schema update for streamingCallback and relaxed coldStart validation.
config/crd/bases/http.keda.sh_httpscaledobjects.yaml CRD schema update for coldStartStreamingCallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +409 to +451
next := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
nextCalled = true
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("data: {\"id\":\"real\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"hello\"}}]}\n\n"))
})

ir := defaultIR()
ir.Spec.ColdStart = &httpv1beta1.ColdStartSpec{
StreamingCallback: &httpv1beta1.StreamingCallbackSpec{
Message: "Model is waking up...",
Interval: metav1.Duration{Duration: 100 * time.Millisecond},
},
}

mw := NewEndpointResolver(next, cache, EndpointResolverConfig{
ReadinessTimeout: 5 * time.Second,
EnableColdStartHeader: true,
})

rec := httptest.NewRecorder()
req := newStreamingRequest(t, ir, `{"stream":true,"messages":[]}`)

// Mark ready after a short delay to simulate cold start.
go func() {
time.Sleep(150 * time.Millisecond)
addReadyEndpoint(cache)
}()

mw.ServeHTTP(rec, req)

if !nextCalled {
t.Fatal("expected next handler to be called after cold start")
}
if got, want := rec.Header().Get("Content-Type"), "text/event-stream"; got != want {
t.Fatalf("Content-Type = %q, want %q", got, want)
}
body := rec.Body.String()
if !strings.Contains(body, "Model is waking up...") {
t.Fatalf("response body should contain callback message, got:\n%s", body)
}
if !strings.Contains(body, `"id":"real"`) {
t.Fatalf("response body should contain upstream response, got:\n%s", body)
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests rely on real time.Sleep with tight deadlines/intervals; this can be flaky under load/CI scheduling. The file already uses testing/synctest for deterministic time—consider wrapping the streaming-callback tests in synctest.Test and advancing the fake clock instead of sleeping.

Suggested change
next := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
nextCalled = true
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("data: {\"id\":\"real\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"hello\"}}]}\n\n"))
})
ir := defaultIR()
ir.Spec.ColdStart = &httpv1beta1.ColdStartSpec{
StreamingCallback: &httpv1beta1.StreamingCallbackSpec{
Message: "Model is waking up...",
Interval: metav1.Duration{Duration: 100 * time.Millisecond},
},
}
mw := NewEndpointResolver(next, cache, EndpointResolverConfig{
ReadinessTimeout: 5 * time.Second,
EnableColdStartHeader: true,
})
rec := httptest.NewRecorder()
req := newStreamingRequest(t, ir, `{"stream":true,"messages":[]}`)
// Mark ready after a short delay to simulate cold start.
go func() {
time.Sleep(150 * time.Millisecond)
addReadyEndpoint(cache)
}()
mw.ServeHTTP(rec, req)
if !nextCalled {
t.Fatal("expected next handler to be called after cold start")
}
if got, want := rec.Header().Get("Content-Type"), "text/event-stream"; got != want {
t.Fatalf("Content-Type = %q, want %q", got, want)
}
body := rec.Body.String()
if !strings.Contains(body, "Model is waking up...") {
t.Fatalf("response body should contain callback message, got:\n%s", body)
}
if !strings.Contains(body, `"id":"real"`) {
t.Fatalf("response body should contain upstream response, got:\n%s", body)
}
synctest.Test(t, func(t *testing.T) {
next := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
nextCalled = true
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("data: {\"id\":\"real\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"hello\"}}]}\n\n"))
})
ir := defaultIR()
ir.Spec.ColdStart = &httpv1beta1.ColdStartSpec{
StreamingCallback: &httpv1beta1.StreamingCallbackSpec{
Message: "Model is waking up...",
Interval: metav1.Duration{Duration: 100 * time.Millisecond},
},
}
mw := NewEndpointResolver(next, cache, EndpointResolverConfig{
ReadinessTimeout: 5 * time.Second,
EnableColdStartHeader: true,
})
rec := httptest.NewRecorder()
req := newStreamingRequest(t, ir, `{"stream":true,"messages":[]}`)
// Mark ready after a short delay to simulate cold start.
go func() {
time.Sleep(150 * time.Millisecond)
addReadyEndpoint(cache)
}()
mw.ServeHTTP(rec, req)
if !nextCalled {
t.Fatal("expected next handler to be called after cold start")
}
if got, want := rec.Header().Get("Content-Type"), "text/event-stream"; got != want {
t.Fatalf("Content-Type = %q, want %q", got, want)
}
body := rec.Body.String()
if !strings.Contains(body, "Model is waking up...") {
t.Fatalf("response body should contain callback message, got:\n%s", body)
}
if !strings.Contains(body, `"id":"real"`) {
t.Fatalf("response body should contain upstream response, got:\n%s", body)
}
})

Copilot uses AI. Check for mistakes.
// ColdStartStreamingCallbackSpec configures SSE streaming callback messages
// sent to clients during cold starts to keep the connection alive.
type ColdStartStreamingCallbackSpec struct {
// The message to send as SSE event content during cold start
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColdStartStreamingCallbackSpec.Message is required but has no MinLength validation, so an empty string is still accepted and results in empty SSE content. Consider adding +kubebuilder:validation:MinLength=1 (matching v1beta1 StreamingCallbackSpec) to prevent misconfiguration.

Suggested change
// The message to send as SSE event content during cold start
// The message to send as SSE event content during cold start
// +kubebuilder:validation:MinLength=1

Copilot uses AI. Check for mistakes.
}

if er.cfg.EnableColdStartHeader {
w.Header().Set(kedahttp.HeaderColdStart, strconv.FormatBool(isColdStart))
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the streaming-callback path the response is already committed with WriteHeader(200), so setting X-KEDA-HTTP-Cold-Start here will not reach real clients (only httptest.ResponseRecorder). If this header is intended for clients, set it before WriteHeader (e.g., set it to "true" up-front for this path) or expose the value via an SSE event/trailer instead.

Suggested change
w.Header().Set(kedahttp.HeaderColdStart, strconv.FormatBool(isColdStart))
// The streaming callback response has already been committed, so a
// normal response header set here would not reach real clients.
// Expose the cold-start value through the SSE stream instead.
if err := writeSSEEvent(w, fmt.Sprintf("%s=%s", kedahttp.HeaderColdStart, strconv.FormatBool(isColdStart))); err != nil {
logger.Error(err, "failed to write cold start streaming callback event")
} else {
_ = rc.Flush()
}

Copilot uses AI. Check for mistakes.
Comment on lines +266 to +270
func (w *headerSuppressingWriter) WriteHeader(code int) {
if w.headerWritten {
return
}
w.headerWritten = true
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

headerSuppressingWriter currently drops all subsequent WriteHeader calls, which means any non-200 status from the upstream proxy (4xx/5xx) will be masked as 200 once SSE headers are committed. Consider capturing the upstream status and translating it into an SSE error/[DONE] (and aborting proxying), or restrict suppression to redundant 200-only calls and log/emit an SSE error for other codes.

Copilot uses AI. Check for mistakes.
Comment on lines +229 to +233
body, err := io.ReadAll(r.Body)
r.Body = io.NopCloser(bytes.NewReader(body))
if err != nil {
return false, err
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isStreamingRequest buffers the entire request body via io.ReadAll, which can be very large for chat/completions and changes request handling from streaming to full-buffering (memory/latency impact) even though you only need to detect the "stream" flag. Consider bounding the read (e.g., MaxBytesReader / Content-Length guard) and/or using a streaming JSON tokenizer/tee approach so large requests don’t require full buffering just to detect streaming.

Copilot uses AI. Check for mistakes.
@linkvt
Copy link
Copy Markdown
Member

linkvt commented Apr 21, 2026

Hi @guimou , thanks for the PR!
This is a very specific request with a very specific implementation, we have to consider how that fits into the HTTP Add-on.
There is a related topic of placeholder responses that I started working on a few days ago based on #1303 , I will have to check if that would fit in there. It might also be a usecase for a "placeholderService" where a user-defined service starts with the implementation. I don't have enough time to dig into that topic this week and am on vacation next week though, just a headsup.

@guimou
Copy link
Copy Markdown
Author

guimou commented Apr 21, 2026

Hey @linkvt !
No worries, and no hurries. You're right, it's very specific, especially the implementation with a continuous SSE response. And it would definitely fit within #1303 scope. At least we can consider this like a PoC that proved to work, it's used in this demo already, something that will be part of Red Hat Summit's presentation.
I'll let you think about it, and happy to help in any ways!

@linkvt linkvt added the on hold label Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants