feat(interceptor): Adds direct-pod routing for cold starts to reduce latency when kube-proxy rule propagation is slow.#1585
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
There was a problem hiding this comment.
Pull request overview
Adds an opt-in “direct-to-pod” routing mode during cold starts to avoid initial request failures caused by delayed kube-proxy/ClusterIP rule sync, and reworks the ready-endpoints cache to retain richer EndpointSlice-derived state (ready pod IPs + port name resolution).
Changes:
- Introduce
KEDA_HTTP_DIRECT_POD_ON_COLD_STARTand plumb it through the interceptor handler chain. - Rework
ReadyEndpointsCacheto store an immutable per-service snapshot (ready IPs +portName → portmap) and addPickReadyEndpoint. - Add/update unit tests for port-name resolution and cold-start direct-to-pod URL rewriting.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/k8s/ready_endpoints_cache.go |
Stores per-service snapshot state and adds PickReadyEndpoint selection logic. |
pkg/k8s/ready_endpoints_cache_test.go |
Expands tests to cover endpoint picking and port extraction/dedup behavior. |
interceptor/middleware/endpoint_resolver.go |
Rewrites upstream URL to pod IP:port on cold start when enabled. |
interceptor/middleware/endpoint_resolver_test.go |
Adds middleware tests validating direct-to-pod behavior and port-name selection. |
interceptor/proxy.go |
Wires the new config flag into the middleware config. |
interceptor/config/serving.go |
Adds the new env-var-backed serving configuration flag. |
Comments suppressed due to low confidence (1)
pkg/k8s/ready_endpoints_cache.go:40
- The notifyCh comment says it is closed "on any change", but Update now only calls broadcast() when readiness transitions (or when slices are deleted). Updating the comment to match the actual broadcast semantics would prevent future readers from assuming every EndpointSlice update will wake waiters.
// Broadcast mechanism: the channel is closed on any change,
// then replaced with a fresh one. Waiters select on the channel.
mu sync.Mutex
notifyCh chan struct{}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
07a578c to
fa6696a
Compare
| // TLSUpstreamDisabled disables TLS for upstream connections even when the proxy server | ||
| // itself has TLS enabled. When true, requests are forwarded to upstream pods over plain | ||
| // HTTP regardless of the proxy TLS setting. | ||
| TLSUpstreamDisabled bool `env:"KEDA_HTTP_PROXY_TLS_UPSTREAM_DISABLED" envDefault:"false"` |
There was a problem hiding this comment.
Continuing here from our chat in Slack:
As far as I understand whether TLS is used is not relevant, we just change the target the traffic is sent to. When TLS is used we can if I understand it correctly set the tlsConfig.ServerName to the service name while still sending our request to the Endpoint IP, see https://pkg.go.dev/crypto/tls#Config
// ServerName is used to verify the hostname on the returned
// certificates unless InsecureSkipVerify is given. It is also included
// in the client's handshake to support virtual hosting unless it is
// an IP address.
ServerName string
There was a problem hiding this comment.
@linkvt
Correct, we can use ServerName (it will be the SNI) which can make the direct to pod IP TLS handshake successful.
I have a few implementations in mind. Can you help me figure out which one I should proceed with?
A. The base transport is built at startup, and the upstream handler uses it to seed a pool keyed by responseHeaderTimeout. Should we extend that same pool to key by (timeout, serverName) combinations? This increases pool cardinality, and since serverName-aware transports are only needed during cold start — is that trade-off worth it?
B. Rather than extending the existing pool, have a dedicated cold start transport pool keyed by (timeout, serverName). This keeps the hot path unchanged and scopes the cardinality growth to cold start scenarios only.
C. Use a temporary, non-pooled transport for cold start requests — created on demand, not reused. This leaves the existing pool entirely untouched, but introduces transport selection logic in the upstream handler to switch between pooled (normal) and ephemeral (cold start) transports.
There was a problem hiding this comment.
Thanks for looking into this, I didn't consider that it would be more complicated 😅
So I think that means we need to reconsider the current TransportPool implementation, A sounds IMO good, we also don't need it just for the cold-start but for all connections I think?
The direct pod routing is useful for cold-starts where syncing the Services ClusterIP might take some time, but can also be used later as well for non cold-start requests.
Pool size is something we could consider now or later in a follow-up PR, the pool will definitely be larger but it's just a few KBs anyway IIRC.
There was a problem hiding this comment.
we also don't need it just for the cold-start but for all connections I think?
For this PR, IMO we should only optimize direct-to-pod for cold start request.
But once we have seperate PR which detect readiness based by probing, like I mentioned in this issue comment then we can consider this optimization for all requests.
Additionally, i will also drop this KEDA_HTTP_PROXY_TLS_UPSTREAM_DISABLED flag. I think we also need a PR that give ability to configure the Upstream TLS config . Because currently we use the same TLS config for upstream. Having this can also give user ability for mTLs with upstream pods.
There was a problem hiding this comment.
Couldn't we just use/modify the current
isColdStart, err := er.readyCache.WaitForReady(waitCtx, serviceKey)
to also return an endpoint IP?
We would not do any extra work at all then and could use the endpoint IP in both cases?
There was a problem hiding this comment.
Using WaitForReady makes sense only when we want to also do direct-to-pod for non cold start request as well.
@linkvt Are you proposing that we do direct-to-pod for all request in this PR itself ?
There was a problem hiding this comment.
Yes I thought that would be easier, no special handling for cold start, always take the endpoint IP from our readyCache. Add a flag to opt-out (nobody does opt-in usually) to fallback to the non endpoint upstream.
WaitForReady is IIRC called in both cold and non coldstart scenarios and would be an ideal place to change?
also wdyt @Fedosin ?
Fedosin
left a comment
There was a problem hiding this comment.
Review
Design-Level Feedback
1. KEDA_HTTP_PROXY_TLS_UPSTREAM_DISABLED should be dropped (agree with @linkvt)
This flag globally disables TLS to upstream for all requests, not just cold-start direct-pod ones. That's a blunt instrument with security implications. The proper solution is to set tlsConfig.ServerName to the original service DNS name when dialing a pod IP directly. That way TLS works correctly without needing a kill-switch. I'd suggest either:
- Handling TLS via
ServerNamein this PR, or - Deferring TLS+direct-pod to a follow-up and only enabling direct-pod for non-TLS setups here.
2. Scope: cold-start-only vs. all requests
@linkvt asked for my opinion. I think starting with cold-start-only is the right incremental approach: it has a clear problem statement (kube-proxy lag) and limits blast radius. However, the API surface should be designed with the eventual "all requests" path in mind. Consider whether the flag name KEDA_HTTP_DIRECT_POD_ON_COLD_START locks us in, or if a more general name (e.g. KEDA_HTTP_DIRECT_POD_ROUTING with mode values like cold-start-only / always / disabled) would be better.
That said, keeping the current flag and adding a separate one later is acceptable too.
3. WaitForReady could return endpoint info (agree with @linkvt)
Since WaitForReady already blocks until the service has ready endpoints, it's the natural point where we know a pod is ready. Having it also return a candidate endpoint would:
- Eliminate the TOCTOU gap between "became ready" and
PickReadyEndpoint - Simplify the endpoint resolver middleware (no separate
PickReadyEndpointcall) - Enable future "always direct-pod" mode trivially
This doesn't need to happen in this PR, but would be a cleaner design for the next iteration.
Code-Level Issues
4. crypto/rand is overkill for load-balancing selection
idx, err := rand.Int(rand.Reader, big.NewInt(int64(len(candidates))))
if err != nil {
return "", 0, false
}Using crypto/rand + big.Int for a simple random index in a hot path adds unnecessary overhead (syscall + heap allocation per call). math/rand/v2 (already used in this repo) with its auto-seeded global generator is sufficient for load-balancing randomness. The crypto/rand error path also silently falls back to ClusterIP, which could mask a real system issue (exhausted entropy).
5. The Update hot path has a subtle ordering concern
if v, ok := c.states.Load(serviceKey); ok {
ptr := v.(*atomic.Pointer[serviceState])
oldState := ptr.Load()
ptr.Store(newState)
...
}If two concurrent Update calls race on the same key, the non-atomic Load-then-Store means the second writer's state could silently overwrite the first's. In practice informer event handlers are serialized per-resource, so this is likely safe today, but it's fragile. A CompareAndSwap loop or brief per-key lock would be more robust and future-proof.
6. PickReadyEndpoint returns ok=false when portName="" and all ports are named
This means users who configure their InterceptorRoute with a numeric Port (resulting in PortName="") will never get direct-pod routing even when the feature is enabled. This limitation should be documented clearly so users know they must use named ports for direct-pod routing to activate.
7. schemeHTTPS / schemeHTTP constants placement
These constants are defined in routing.go but referenced from endpoint_resolver.go (same package, so it compiles). Consider placing shared constants in a dedicated file or at the package level to make the dependency explicit and avoid confusion when reading either file in isolation.
Minor Nits
- The startup validation in
main.gobecomes unnecessary onceTLSUpstreamDisabledis dropped. If you keep it, note thatruntime.Goexit()after logging is consistent with the rest ofmain.gobut a brief comment explaining why (deferred cleanup) would help future readers. - Missing changelog entry and documentation (per the unchecked checklist items).
What's Good
- The
serviceStateimmutable snapshot approach is well-designed: atomic pointer swap gives readers lock-free access without contention. - The
(ip, port)pairing per-slice incollectServiceStatecorrectly handles rolling deploys where different slices expose different port numbers for the same port name. - Deduplication logic prevents the same
(ip, port)pair from appearing multiple times across slices. - Test coverage is thorough: multi-port, unnamed port, empty port name, heterogeneous ports, and the rolling-deploy scenario are all exercised.
- Graceful fallback to ClusterIP when
PickReadyEndpointreturnsok=falseis safe and correct.
Summary
The core idea and cache refactoring are solid. Main blockers before merge:
- Resolve the TLS approach (drop
TLSUpstreamDisabled, useServerNameor scope to non-TLS only). - Switch
crypto/randtomath/rand/v2. - Add changelog + document the named-port requirement.
- Adds KEDA_HTTP_DIRECT_POD_ROUTING (disabled|cold-start-only); when set, rewrites upstream URL to a ready pod's ip:port after scale-from-zero - ReadyEndpointsCache now tracks (ip, port) pairs keyed by named port via EndpointSlice updates; WaitForReady returns a podHost alongside isColdStart - Empty podHost (no portName match) leaves upstream URL unchanged, falling back to ClusterIP as before - TransportPool keyed on (responseHeaderTimeout, serverName) with per-transport TLSClientConfig.ServerName for correct SNI per pod - Routing middleware stores intended TLS hostname in context before any URL rewrite so SNI always points to the original service Signed-off-by: Atharva Pakade <pakade310@gmail.com>
Fedosin
left a comment
There was a problem hiding this comment.
Review
The code looks good overall and I don't see any reason not to merge it except the IPv6 incompatibility in pickHost. All previous review feedback has been addressed.
Issues
1. IPv6 compatibility in pickHost (medium)
pickHost formats the host as:
return fmt.Sprintf("%s:%d", ep.ip, ep.port)For IPv6 pod addresses (e.g. fd00::1), this would produce fd00::1:8080 instead of the correct [fd00::1]:8080. Please use net.JoinHostPort(ep.ip, strconv.Itoa(int(ep.port))) instead, which correctly handles both IPv4 and IPv6. Kubernetes dual-stack clusters assign IPv6 pod IPs, so this is a real (if uncommon) scenario.
2. Truncated comment in context.go (nit)
// ContextWithUpstreamServerName stores the intended TLS server name (SNI) for the upstream.
// This must be set before any middleware rewrites the upstream URL (e.g. direct-to-pod),The second line ends with a comma, suggesting a truncated sentence. Should be completed or end with a period.
3. Broadcast channel comment accuracy (nit)
The comment on the notifyCh field says "closed on any change", but swapState now only broadcasts on readiness transitions (oldState.hasReady() != newState.hasReady()). The comment should reflect this narrower contract.
linkvt
left a comment
There was a problem hiding this comment.
The initially trivial idea of "just connecting to endpoints instead of cluster ips" really showed to be way more tricky...
I tried to understand all changes but didn't finish the whole PR yet as some findings will again change things quite a bit IMO. I understand this is quite complex, let me know if I should give it a try as well based on these changes to show what I'm thinking about if my comments are a bit cryptic.
| switch s.DirectPodRouting { | ||
| case DirectPodRoutingDisabled, DirectPodRoutingColdStartOnly: | ||
| // valid | ||
| default: | ||
| panic(fmt.Sprintf("invalid KEDA_HTTP_DIRECT_POD_ROUTING value %q: must be %q or %q", | ||
| s.DirectPodRouting, DirectPodRoutingDisabled, DirectPodRoutingColdStartOnly)) | ||
| } |
There was a problem hiding this comment.
This could be implemented directly close to the struct by implementing the TextMarshaller interface so that MustParseServing stays clean:
func (m *DirectPodRoutingMode) UnmarshalText(text []byte) error {
switch DirectPodRoutingMode(text) {
case DirectPodRoutingDisabled, DirectPodRoutingColdStartOnly:
*m = DirectPodRoutingMode(text)
return nil
default:
return fmt.Errorf("invalid value %q: must be %q or %q",
text, DirectPodRoutingDisabled, DirectPodRoutingColdStartOnly)
}
}| LogRequests bool `env:"KEDA_HTTP_LOG_REQUESTS" envDefault:"false"` | ||
| // DirectPodRouting controls when the interceptor routes directly to a pod IP | ||
| // instead of the ClusterIP service. Valid values: "disabled", "cold-start-only". | ||
| DirectPodRouting DirectPodRoutingMode `env:"KEDA_HTTP_DIRECT_POD_ROUTING" envDefault:"disabled"` |
There was a problem hiding this comment.
Couldnt this be a simple boolean? Internally we later only use a boolean as well.
Is there a reason for why we would use direct pod routing only on cold starts according to this config (didnt check the full PR yet)?
There was a problem hiding this comment.
@Fedosin Suggested in the review that we should give option for different modes comment link section 2. Scope: cold-start-only vs. all requests
| // Note: direct-pod routing (when enabled on the interceptor) requires portName; | ||
| // routes using a numeric port will always be forwarded via the Service ClusterIP. |
There was a problem hiding this comment.
This seems like a major restriction, I don't see a reason for it yet to be honest:
I see these cases:
- single port service, unnamed port
- endpointslice contains only one port with
name: "" - IR uses port number: resolve port name from Service, match by name in EndpointSlice
- IR uses port name: not possible (service port is unnamed, nothing to reference)
- endpointslice contains only one port with
- single port service, named port
- endpointslice contains only one port where
nameis equal to the name of the port in the service - IR uses port number: resolve port name from Service, match by name in EndpointSlice
- IR uses port name: exact match by name
- endpointslice contains only one port where
- multi port service, unnamed ports
- not possible, Kubernetes requires names when multiple ports are defined: https://kubernetes.io/docs/concepts/services-networking/service/#multi-port-services
- multi port service, named ports
- endpointslice contains multiple ports, each named after the corresponding service port
- IR uses port number: resolve port name from Service, match by name in EndpointSlice
- IR uses port name: exact match by name
In all cases, we can resolve to the correct EndpointSlice port without requiring portName in the IR.
The "IR uses port number" cases all work the same way: resolve the numeric port to its port name via the Service object.
The routing middleware (resolveUpstreamURL) already does that Service lookup to build the upstream URL, so it could pass the resolved port name down via context at the same time.
There was a problem hiding this comment.
In the multi port service, named ports case, there is a possibility that the user specifies port names on the Service but does not provide portName at the time of configuring the InterceptorRoute. The note was targeted at this case to avoid misconfigurations.
Based on your suggestion, I feel this case can also be resolved by resolving the portName in the Routing middleware and passing it via context, by adding a separate resolvePortName method that does the reverse lookup (numeric port → port name via Service.Spec.Ports). In the EndpointResolver middleware, we can then refer to the resolved portName from context instead of ir.Spec.Target.PortName directly.
| // | ||
| // NOTE: Transports are never evicted, we expect a low cardinality of timeouts | ||
| // NOTE: Transports are never evicted; we expect a low cardinality of (timeout, serverName) combinations. | ||
| type TransportPool struct { |
There was a problem hiding this comment.
I think all of these transport pool changes could be avoided when we would be using transport.TLSDialContext and resolve the ServerName while dialing.
This would be a smaller change than what we're doing right and IMO cleaner, but a bit more complicated.
The base transport is currently created in proxy.go outside of this whole middleware chain, not sure if it should still be placed there after this change.
There was a problem hiding this comment.
Agreed, we discussed this in a previous comment and based on your recommendation i opted this approach.
IMO we should have a seperate PR/Issue addressing the ability to configure TLSDialContext, DialContext and moving base transport outside the whole middleware chain.
| // "namespace/service" -> *atomic.Bool | ||
| ready sync.Map | ||
| // "namespace/service" -> *atomic.Pointer[serviceState] | ||
| states sync.Map |
There was a problem hiding this comment.
I read a bit more into this and noticed trhat sync.Map offers already Swap etc, we don't need atomic wrappers at all as it seems, see https://pkg.go.dev/sync#Map.Swap
There was a problem hiding this comment.
With sync.Map.Swap, every update would call Swap on the map itself, which acquires a mutex internally even for existing keys (lock on every update).
The atomic wrapper is the entire reason we avoid that — the map value is a stable pointer that never changes, only what it points to does (lock only on first insertion).
|
|
||
| hasReady := slices.ContainsFunc(endpointSlices, hasAnyReadyEndpoint) | ||
| newState := collectServiceState(endpointSlices) | ||
|
|
There was a problem hiding this comment.
I dont understand the new code completely yet but dont understand, why this hot and cold path would be needed?
Can't we just swap and broadcast if ready changes?
There was a problem hiding this comment.
The hot/cold split exists precisely to keep the hot path lock-free — as mentioned in comment, if we just called sync.Map.Swap directly, we'd acquire a mutex on every endpoint update, defeating the entire point of the atomic wrapper.
The cold path pays the mutex cost once via LoadOrStore (first insertion). After that, every update stays on the hot path: states.Load + ptr.CompareAndSwap — no map write, no mutex.
Key changes:
New
KEDA_HTTP_DIRECT_POD_ROUTINGenvironment variabledisabled(default) andcold-start-only.ReadyEndpointsCachenow tracks (ip, port) pairsWaitForReadynow returns a podHost (ip:port string) in addition to the isColdStart bool. Stores ready endpoint state as a map ofserviceKey → {ip, ports-by-name}backed by EndpointSlice updates, replacing the previous bool-only ready flag.Cold-start URL rewrite in EndpointResolver
TLS SNI preservation
TransportPool.Getnow takes a serverName argument and keys transports on(responseHeaderTimeout, serverName), applyingtls.Config.ServerNameper transport.util.UpstreamServerNameFromContexttotransportPool.Get.Checklist
README.mddocs/directoryFixes #1473