Skip to content

feat(interceptor): Adds direct-pod routing for cold starts to reduce latency when kube-proxy rule propagation is slow.#1585

Open
AtharvaPakade wants to merge 1 commit into
kedacore:mainfrom
AtharvaPakade:main
Open

feat(interceptor): Adds direct-pod routing for cold starts to reduce latency when kube-proxy rule propagation is slow.#1585
AtharvaPakade wants to merge 1 commit into
kedacore:mainfrom
AtharvaPakade:main

Conversation

@AtharvaPakade
Copy link
Copy Markdown
Contributor

@AtharvaPakade AtharvaPakade commented Apr 18, 2026

Key changes:

New KEDA_HTTP_DIRECT_POD_ROUTING environment variable

  • Adds DirectPodRoutingMode type with two values: disabled (default) and cold-start-only.
  • On cold start, the interceptor waits for a pod to become ready then rewrites the upstream URL directly to that pod's ip:port, bypassing the service ClusterIP and avoiding latency from kube-proxy rules that are slow to propagate after scale-from-zero.

ReadyEndpointsCache now tracks (ip, port) pairs

  • WaitForReady now returns a podHost (ip:port string) in addition to the isColdStart bool. Stores ready endpoint state as a map of serviceKey → {ip, ports-by-name} backed by EndpointSlice updates, replacing the previous bool-only ready flag.
  • Port lookup is keyed on the named port from the InterceptorRoute's PortName field, so multi-port pods route to thecorrect container port.
  • Returns podHost="" when PortName has no match, leaving the upstream URL unchanged.

Cold-start URL rewrite in EndpointResolver

  • When DirectPodOnColdStart=true, isColdStart=true, and podHost!="", the upstream URL's Host is replaced with the pod ip:port.
  • The rewrite is skipped when the request falls back to an alternate upstream — the fallback URL is not overwritten.
  • For HTTPS fallback upstreams, the TLS server name context is updated to the fallback hostname.

TLS SNI preservation

  • Routing middleware now stores the intended TLS server name in context before any downstream middleware can rewrite the upstream URL.
  • TransportPool.Get now takes a serverName argument and keys transports on (responseHeaderTimeout, serverName), applying tls.Config.ServerName per transport.
  • Upstream handler passes util.UpstreamServerNameFromContext to transportPool.Get.

Checklist

Fixes #1473

@AtharvaPakade AtharvaPakade requested a review from a team as a code owner April 18, 2026 20:36
Copilot AI review requested due to automatic review settings April 18, 2026 20:36
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented Apr 18, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in “direct-to-pod” routing mode during cold starts to avoid initial request failures caused by delayed kube-proxy/ClusterIP rule sync, and reworks the ready-endpoints cache to retain richer EndpointSlice-derived state (ready pod IPs + port name resolution).

Changes:

  • Introduce KEDA_HTTP_DIRECT_POD_ON_COLD_START and plumb it through the interceptor handler chain.
  • Rework ReadyEndpointsCache to store an immutable per-service snapshot (ready IPs + portName → port map) and add PickReadyEndpoint.
  • Add/update unit tests for port-name resolution and cold-start direct-to-pod URL rewriting.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/k8s/ready_endpoints_cache.go Stores per-service snapshot state and adds PickReadyEndpoint selection logic.
pkg/k8s/ready_endpoints_cache_test.go Expands tests to cover endpoint picking and port extraction/dedup behavior.
interceptor/middleware/endpoint_resolver.go Rewrites upstream URL to pod IP:port on cold start when enabled.
interceptor/middleware/endpoint_resolver_test.go Adds middleware tests validating direct-to-pod behavior and port-name selection.
interceptor/proxy.go Wires the new config flag into the middleware config.
interceptor/config/serving.go Adds the new env-var-backed serving configuration flag.
Comments suppressed due to low confidence (1)

pkg/k8s/ready_endpoints_cache.go:40

  • The notifyCh comment says it is closed "on any change", but Update now only calls broadcast() when readiness transitions (or when slices are deleted). Updating the comment to match the actual broadcast semantics would prevent future readers from assuming every EndpointSlice update will wake waiters.
	// Broadcast mechanism: the channel is closed on any change,
	// then replaced with a fresh one. Waiters select on the channel.
	mu       sync.Mutex
	notifyCh chan struct{}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/k8s/ready_endpoints_cache.go Outdated
Comment thread pkg/k8s/ready_endpoints_cache.go Outdated
Comment thread interceptor/config/serving.go Outdated
Comment thread pkg/k8s/ready_endpoints_cache.go
Comment thread interceptor/middleware/endpoint_resolver.go Outdated
Comment thread interceptor/config/serving.go Outdated
// TLSUpstreamDisabled disables TLS for upstream connections even when the proxy server
// itself has TLS enabled. When true, requests are forwarded to upstream pods over plain
// HTTP regardless of the proxy TLS setting.
TLSUpstreamDisabled bool `env:"KEDA_HTTP_PROXY_TLS_UPSTREAM_DISABLED" envDefault:"false"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing here from our chat in Slack:

As far as I understand whether TLS is used is not relevant, we just change the target the traffic is sent to. When TLS is used we can if I understand it correctly set the tlsConfig.ServerName to the service name while still sending our request to the Endpoint IP, see https://pkg.go.dev/crypto/tls#Config

	// ServerName is used to verify the hostname on the returned
	// certificates unless InsecureSkipVerify is given. It is also included
	// in the client's handshake to support virtual hosting unless it is
	// an IP address.
	ServerName string

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linkvt
Correct, we can use ServerName (it will be the SNI) which can make the direct to pod IP TLS handshake successful.

I have a few implementations in mind. Can you help me figure out which one I should proceed with?

A. The base transport is built at startup, and the upstream handler uses it to seed a pool keyed by responseHeaderTimeout. Should we extend that same pool to key by (timeout, serverName) combinations? This increases pool cardinality, and since serverName-aware transports are only needed during cold start — is that trade-off worth it?

B. Rather than extending the existing pool, have a dedicated cold start transport pool keyed by (timeout, serverName). This keeps the hot path unchanged and scopes the cardinality growth to cold start scenarios only.

C. Use a temporary, non-pooled transport for cold start requests — created on demand, not reused. This leaves the existing pool entirely untouched, but introduces transport selection logic in the upstream handler to switch between pooled (normal) and ephemeral (cold start) transports.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this, I didn't consider that it would be more complicated 😅
So I think that means we need to reconsider the current TransportPool implementation, A sounds IMO good, we also don't need it just for the cold-start but for all connections I think?
The direct pod routing is useful for cold-starts where syncing the Services ClusterIP might take some time, but can also be used later as well for non cold-start requests.

Pool size is something we could consider now or later in a follow-up PR, the pool will definitely be larger but it's just a few KBs anyway IIRC.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also don't need it just for the cold-start but for all connections I think?

For this PR, IMO we should only optimize direct-to-pod for cold start request.
But once we have seperate PR which detect readiness based by probing, like I mentioned in this issue comment then we can consider this optimization for all requests.

Additionally, i will also drop this KEDA_HTTP_PROXY_TLS_UPSTREAM_DISABLED flag. I think we also need a PR that give ability to configure the Upstream TLS config . Because currently we use the same TLS config for upstream. Having this can also give user ability for mTLs with upstream pods.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just use/modify the current

	isColdStart, err := er.readyCache.WaitForReady(waitCtx, serviceKey)

to also return an endpoint IP?
We would not do any extra work at all then and could use the endpoint IP in both cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using WaitForReady makes sense only when we want to also do direct-to-pod for non cold start request as well.
@linkvt Are you proposing that we do direct-to-pod for all request in this PR itself ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought that would be easier, no special handling for cold start, always take the endpoint IP from our readyCache. Add a flag to opt-out (nobody does opt-in usually) to fallback to the non endpoint upstream.
WaitForReady is IIRC called in both cold and non coldstart scenarios and would be an ideal place to change?

also wdyt @Fedosin ?

Copy link
Copy Markdown
Contributor

@Fedosin Fedosin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Design-Level Feedback

1. KEDA_HTTP_PROXY_TLS_UPSTREAM_DISABLED should be dropped (agree with @linkvt)

This flag globally disables TLS to upstream for all requests, not just cold-start direct-pod ones. That's a blunt instrument with security implications. The proper solution is to set tlsConfig.ServerName to the original service DNS name when dialing a pod IP directly. That way TLS works correctly without needing a kill-switch. I'd suggest either:

  • Handling TLS via ServerName in this PR, or
  • Deferring TLS+direct-pod to a follow-up and only enabling direct-pod for non-TLS setups here.

2. Scope: cold-start-only vs. all requests

@linkvt asked for my opinion. I think starting with cold-start-only is the right incremental approach: it has a clear problem statement (kube-proxy lag) and limits blast radius. However, the API surface should be designed with the eventual "all requests" path in mind. Consider whether the flag name KEDA_HTTP_DIRECT_POD_ON_COLD_START locks us in, or if a more general name (e.g. KEDA_HTTP_DIRECT_POD_ROUTING with mode values like cold-start-only / always / disabled) would be better.

That said, keeping the current flag and adding a separate one later is acceptable too.

3. WaitForReady could return endpoint info (agree with @linkvt)

Since WaitForReady already blocks until the service has ready endpoints, it's the natural point where we know a pod is ready. Having it also return a candidate endpoint would:

  • Eliminate the TOCTOU gap between "became ready" and PickReadyEndpoint
  • Simplify the endpoint resolver middleware (no separate PickReadyEndpoint call)
  • Enable future "always direct-pod" mode trivially

This doesn't need to happen in this PR, but would be a cleaner design for the next iteration.


Code-Level Issues

4. crypto/rand is overkill for load-balancing selection

idx, err := rand.Int(rand.Reader, big.NewInt(int64(len(candidates))))
if err != nil {
    return "", 0, false
}

Using crypto/rand + big.Int for a simple random index in a hot path adds unnecessary overhead (syscall + heap allocation per call). math/rand/v2 (already used in this repo) with its auto-seeded global generator is sufficient for load-balancing randomness. The crypto/rand error path also silently falls back to ClusterIP, which could mask a real system issue (exhausted entropy).

5. The Update hot path has a subtle ordering concern

if v, ok := c.states.Load(serviceKey); ok {
    ptr := v.(*atomic.Pointer[serviceState])
    oldState := ptr.Load()
    ptr.Store(newState)
    ...
}

If two concurrent Update calls race on the same key, the non-atomic Load-then-Store means the second writer's state could silently overwrite the first's. In practice informer event handlers are serialized per-resource, so this is likely safe today, but it's fragile. A CompareAndSwap loop or brief per-key lock would be more robust and future-proof.

6. PickReadyEndpoint returns ok=false when portName="" and all ports are named

This means users who configure their InterceptorRoute with a numeric Port (resulting in PortName="") will never get direct-pod routing even when the feature is enabled. This limitation should be documented clearly so users know they must use named ports for direct-pod routing to activate.

7. schemeHTTPS / schemeHTTP constants placement

These constants are defined in routing.go but referenced from endpoint_resolver.go (same package, so it compiles). Consider placing shared constants in a dedicated file or at the package level to make the dependency explicit and avoid confusion when reading either file in isolation.


Minor Nits

  • The startup validation in main.go becomes unnecessary once TLSUpstreamDisabled is dropped. If you keep it, note that runtime.Goexit() after logging is consistent with the rest of main.go but a brief comment explaining why (deferred cleanup) would help future readers.
  • Missing changelog entry and documentation (per the unchecked checklist items).

What's Good

  • The serviceState immutable snapshot approach is well-designed: atomic pointer swap gives readers lock-free access without contention.
  • The (ip, port) pairing per-slice in collectServiceState correctly handles rolling deploys where different slices expose different port numbers for the same port name.
  • Deduplication logic prevents the same (ip, port) pair from appearing multiple times across slices.
  • Test coverage is thorough: multi-port, unnamed port, empty port name, heterogeneous ports, and the rolling-deploy scenario are all exercised.
  • Graceful fallback to ClusterIP when PickReadyEndpoint returns ok=false is safe and correct.

Summary

The core idea and cache refactoring are solid. Main blockers before merge:

  1. Resolve the TLS approach (drop TLSUpstreamDisabled, use ServerName or scope to non-TLS only).
  2. Switch crypto/rand to math/rand/v2.
  3. Add changelog + document the named-port requirement.

@AtharvaPakade AtharvaPakade changed the title feat: add direct-pod routing on cold start with per-port-name resolution Adds direct-pod routing for cold starts to reduce latency when kube-proxy rule propagation is slow. May 3, 2026
@AtharvaPakade AtharvaPakade changed the title Adds direct-pod routing for cold starts to reduce latency when kube-proxy rule propagation is slow. feat(interceptor): Adds direct-pod routing for cold starts to reduce latency when kube-proxy rule propagation is slow. May 3, 2026
@keda-automation keda-automation requested a review from a team May 3, 2026 18:38
- Adds KEDA_HTTP_DIRECT_POD_ROUTING (disabled|cold-start-only); when set,
  rewrites upstream URL to a ready pod's ip:port after scale-from-zero
- ReadyEndpointsCache now tracks (ip, port) pairs keyed by named port via
  EndpointSlice updates; WaitForReady returns a podHost alongside isColdStart
- Empty podHost (no portName match) leaves upstream URL unchanged, falling
  back to ClusterIP as before
- TransportPool keyed on (responseHeaderTimeout, serverName) with per-transport
  TLSClientConfig.ServerName for correct SNI per pod
- Routing middleware stores intended TLS hostname in context before any URL
  rewrite so SNI always points to the original service

Signed-off-by: Atharva Pakade <pakade310@gmail.com>
@AtharvaPakade
Copy link
Copy Markdown
Contributor Author

Hi @linkvt, @Fedosin, thanks for having a look at the PR! I've pushed the changes based on your suggestions. Please let me know if there's anything else to address.

@AtharvaPakade AtharvaPakade requested review from Fedosin and linkvt May 4, 2026 05:57
Copy link
Copy Markdown
Contributor

@Fedosin Fedosin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The code looks good overall and I don't see any reason not to merge it except the IPv6 incompatibility in pickHost. All previous review feedback has been addressed.

Issues

1. IPv6 compatibility in pickHost (medium)

pickHost formats the host as:

return fmt.Sprintf("%s:%d", ep.ip, ep.port)

For IPv6 pod addresses (e.g. fd00::1), this would produce fd00::1:8080 instead of the correct [fd00::1]:8080. Please use net.JoinHostPort(ep.ip, strconv.Itoa(int(ep.port))) instead, which correctly handles both IPv4 and IPv6. Kubernetes dual-stack clusters assign IPv6 pod IPs, so this is a real (if uncommon) scenario.

2. Truncated comment in context.go (nit)

// ContextWithUpstreamServerName stores the intended TLS server name (SNI) for the upstream.
// This must be set before any middleware rewrites the upstream URL (e.g. direct-to-pod),

The second line ends with a comma, suggesting a truncated sentence. Should be completed or end with a period.

3. Broadcast channel comment accuracy (nit)

The comment on the notifyCh field says "closed on any change", but swapState now only broadcasts on readiness transitions (oldState.hasReady() != newState.hasReady()). The comment should reflect this narrower contract.

Copy link
Copy Markdown
Member

@linkvt linkvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initially trivial idea of "just connecting to endpoints instead of cluster ips" really showed to be way more tricky...
I tried to understand all changes but didn't finish the whole PR yet as some findings will again change things quite a bit IMO. I understand this is quite complex, let me know if I should give it a try as well based on these changes to show what I'm thinking about if my comments are a bit cryptic.

Comment on lines +78 to +84
switch s.DirectPodRouting {
case DirectPodRoutingDisabled, DirectPodRoutingColdStartOnly:
// valid
default:
panic(fmt.Sprintf("invalid KEDA_HTTP_DIRECT_POD_ROUTING value %q: must be %q or %q",
s.DirectPodRouting, DirectPodRoutingDisabled, DirectPodRoutingColdStartOnly))
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be implemented directly close to the struct by implementing the TextMarshaller interface so that MustParseServing stays clean:

func (m *DirectPodRoutingMode) UnmarshalText(text []byte) error {
    switch DirectPodRoutingMode(text) {
    case DirectPodRoutingDisabled, DirectPodRoutingColdStartOnly:
        *m = DirectPodRoutingMode(text)
        return nil
    default:
        return fmt.Errorf("invalid value %q: must be %q or %q",
            text, DirectPodRoutingDisabled, DirectPodRoutingColdStartOnly)
    }
}

See https://github.com/caarlos0/env/blob/a72d89a8930fc800372a6a338a1acf33e5cc3a56/example_test.go#L212-L218

LogRequests bool `env:"KEDA_HTTP_LOG_REQUESTS" envDefault:"false"`
// DirectPodRouting controls when the interceptor routes directly to a pod IP
// instead of the ClusterIP service. Valid values: "disabled", "cold-start-only".
DirectPodRouting DirectPodRoutingMode `env:"KEDA_HTTP_DIRECT_POD_ROUTING" envDefault:"disabled"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldnt this be a simple boolean? Internally we later only use a boolean as well.
Is there a reason for why we would use direct pod routing only on cold starts according to this config (didnt check the full PR yet)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fedosin Suggested in the review that we should give option for different modes comment link section 2. Scope: cold-start-only vs. all requests

Comment on lines +107 to +108
// Note: direct-pod routing (when enabled on the interceptor) requires portName;
// routes using a numeric port will always be forwarded via the Service ClusterIP.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a major restriction, I don't see a reason for it yet to be honest:

I see these cases:

  • single port service, unnamed port
    • endpointslice contains only one port with name: ""
    • IR uses port number: resolve port name from Service, match by name in EndpointSlice
    • IR uses port name: not possible (service port is unnamed, nothing to reference)
  • single port service, named port
    • endpointslice contains only one port where name is equal to the name of the port in the service
    • IR uses port number: resolve port name from Service, match by name in EndpointSlice
    • IR uses port name: exact match by name
  • multi port service, unnamed ports
  • multi port service, named ports
    • endpointslice contains multiple ports, each named after the corresponding service port
    • IR uses port number: resolve port name from Service, match by name in EndpointSlice
    • IR uses port name: exact match by name

In all cases, we can resolve to the correct EndpointSlice port without requiring portName in the IR.
The "IR uses port number" cases all work the same way: resolve the numeric port to its port name via the Service object.

The routing middleware (resolveUpstreamURL) already does that Service lookup to build the upstream URL, so it could pass the resolved port name down via context at the same time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the multi port service, named ports case, there is a possibility that the user specifies port names on the Service but does not provide portName at the time of configuring the InterceptorRoute. The note was targeted at this case to avoid misconfigurations.

Based on your suggestion, I feel this case can also be resolved by resolving the portName in the Routing middleware and passing it via context, by adding a separate resolvePortName method that does the reverse lookup (numeric port → port name via Service.Spec.Ports). In the EndpointResolver middleware, we can then refer to the resolved portName from context instead of ir.Spec.Target.PortName directly.

//
// NOTE: Transports are never evicted, we expect a low cardinality of timeouts
// NOTE: Transports are never evicted; we expect a low cardinality of (timeout, serverName) combinations.
type TransportPool struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of these transport pool changes could be avoided when we would be using transport.TLSDialContext and resolve the ServerName while dialing.
This would be a smaller change than what we're doing right and IMO cleaner, but a bit more complicated.
The base transport is currently created in proxy.go outside of this whole middleware chain, not sure if it should still be placed there after this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we discussed this in a previous comment and based on your recommendation i opted this approach.

IMO we should have a seperate PR/Issue addressing the ability to configure TLSDialContext, DialContext and moving base transport outside the whole middleware chain.

// "namespace/service" -> *atomic.Bool
ready sync.Map
// "namespace/service" -> *atomic.Pointer[serviceState]
states sync.Map
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read a bit more into this and noticed trhat sync.Map offers already Swap etc, we don't need atomic wrappers at all as it seems, see https://pkg.go.dev/sync#Map.Swap

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With sync.Map.Swap, every update would call Swap on the map itself, which acquires a mutex internally even for existing keys (lock on every update).

The atomic wrapper is the entire reason we avoid that — the map value is a stable pointer that never changes, only what it points to does (lock only on first insertion).


hasReady := slices.ContainsFunc(endpointSlices, hasAnyReadyEndpoint)
newState := collectServiceState(endpointSlices)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand the new code completely yet but dont understand, why this hot and cold path would be needed?
Can't we just swap and broadcast if ready changes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hot/cold split exists precisely to keep the hot path lock-free — as mentioned in comment, if we just called sync.Map.Swap directly, we'd acquire a mutex on every endpoint update, defeating the entire point of the atomic wrapper.

The cold path pays the mutex cost once via LoadOrStore (first insertion). After that, every update stays on the hot path: states.Load + ptr.CompareAndSwap — no map write, no mutex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluate sending requests to Endpoint IPs directly

4 participants