fix: handle HTTPScaledObject not found gracefully in scaler#1488
fix: handle HTTPScaledObject not found gracefully in scaler#1488Fedosin wants to merge 1 commit into
Conversation
Signed-off-by: Mikhail Fedosin <mfedosin@redhat.com>
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
| metricName := MetricName(namespacedName) | ||
|
|
||
| httpso := &httpv1alpha1.HTTPScaledObject{} | ||
| if err := e.reader.Get(ctx, types.NamespacedName{Namespace: sor.Namespace, Name: httpScaledObjectName}, httpso); err != nil { |
There was a problem hiding this comment.
Minor nitpicks:
- we have
types.NamespacedName{Namespace: sor.Namespace, Name: httpScaledObjectName}here and above thek8s.NamespacedNameFromNameAndNamespace(...)which also returns aNamespacedName not yet available in cacheis maybe a bit specific?
There was a problem hiding this comment.
Pull request overview
This PR addresses a race condition in the external scaler where controller-runtime’s cache can temporarily return NotFound for a newly created HTTPScaledObject, causing GetMetrics errors that terminate the StreamIsActive gRPC stream and trigger reconnection/error storms.
Changes:
- Handle
apierrors.IsNotFoundfrome.reader.Get()inGetMetricsby returning a metric value of0(inactive) and logging at a verbose/debug level. - Refactor metric name generation to happen before the
Get()call so it can be reused in theNotFoundearly-return path. - Add an Unreleased changelog entry documenting the fix.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scaler/handlers.go | Adds IsNotFound handling in GetMetrics to keep the gRPC stream alive during cache sync lag. |
| CHANGELOG.md | Documents the scaler fix in the Unreleased “Fixes” section. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }, | ||
| }, nil | ||
| } | ||
| lggr.Error(err, "unable to get HTTPScaledObject", "name", httpScaledObjectName, "namespace", sor.Namespace, "httpScaledObjectName", httpScaledObjectName) |
There was a problem hiding this comment.
The error log fields here are inconsistent and redundant: the key "name" is set to httpScaledObjectName, and "httpScaledObjectName" is also provided with the same value. Elsewhere in this file (e.g., GetMetricSpec) "name" refers to the ScaledObject name (sor.Name). Consider using a distinct key for the ScaledObject name (or set "name" to sor.Name) and keep only one field for the HTTPScaledObject name to avoid confusing logs.
| lggr.Error(err, "unable to get HTTPScaledObject", "name", httpScaledObjectName, "namespace", sor.Namespace, "httpScaledObjectName", httpScaledObjectName) | |
| lggr.Error(err, "unable to get HTTPScaledObject", "name", sor.Name, "namespace", sor.Namespace, "httpScaledObjectName", httpScaledObjectName) |
| if err := e.reader.Get(ctx, types.NamespacedName{Namespace: sor.Namespace, Name: httpScaledObjectName}, httpso); err != nil { | ||
| if apierrors.IsNotFound(err) { | ||
| lggr.V(1).Info("HTTPScaledObject not yet available in cache, returning 0", "httpScaledObjectName", httpScaledObjectName, "namespace", sor.Namespace) | ||
| return &externalscaler.GetMetricsResponse{ | ||
| MetricValues: []*externalscaler.MetricValue{ | ||
| { | ||
| MetricName: metricName, | ||
| MetricValue: 0, | ||
| }, | ||
| }, | ||
| }, nil | ||
| } |
There was a problem hiding this comment.
The new NotFound handling path (returning a 0 metric value with nil error) changes behavior in a way that's easy to regress. Please add a unit test that exercises e.reader.Get returning a NotFound (e.g., fake client with no HTTPScaledObject) and asserts GetMetrics returns MetricValue=0 with no error (and ideally that StreamIsActive/IsActive remains healthy).
| // generated the metric name for HTTPScaledObject | ||
| namespacedName := k8s.NamespacedNameFromNameAndNamespace(httpScaledObjectName, sor.Namespace) | ||
| metricName := MetricName(namespacedName) |
There was a problem hiding this comment.
Nit: the comment reads "generated the metric name..." but this code is generating it now; consider changing to "generate the metric name..." for grammatical correctness (same phrasing appears in a few places in this file).
wozniakjan
left a comment
There was a problem hiding this comment.
This PR fixes the race condition by handling the IsNotFound case in GetMetrics gracefully: instead of returning an error, it returns metric value 0 (not active) and logs at debug level. Once the cache syncs (typically within a second), subsequent calls return real values. The gRPC stream stays alive, avoiding the reconnection storm.
won't this risk scaling the app prematurely and kind of randomly to zero?
scenario:
- an autoscaled app configured with minReplicas = 0
- heavy traffic going through interceptor
- scaler restarted, caches not yet synced
- KEDA asks for metric, scaler can't find HSO => returns 0 => KEDA happily scales to zero
- meanwhile current implementation - errors are covered there by fallback and failure threshold so app won't scale to 0
linkvt
left a comment
There was a problem hiding this comment.
Gave it another look after Jans review request here, only found the other location where the IsNotFound check might possibly missing.
@wozniakjan from what I understand the scaler will not respond until the caches are synced:
Lines 125 to 137 in 4dd989f
I guess the fallback scenario would only make sense then if the scaler can't access the HTTPSO/IR for some reason while the interceptor is able and continues to serve traffic or am I missing something?
| } | ||
|
|
||
| httpso := &httpv1alpha1.HTTPScaledObject{} | ||
| if err := e.reader.Get(ctx, types.NamespacedName{Namespace: sor.Namespace, Name: httpScaledObjectName}, httpso); err != nil { |
There was a problem hiding this comment.
I'm currently not sure if the fix below was enough for the errors to disappear as GetMetricSpec is IIRC called before GetMetrics and would probably have errors as well?
Or should we add the same check there as well?
This PR fixes the race condition by handling the
IsNotFoundcase inGetMetricsgracefully: instead of returning an error, it returns metric value0(not active) and logs at debug level. Once the cache syncs (typically within a second), subsequent calls return real values. The gRPC stream stays alive, avoiding the reconnection storm.Checklist
README.mddocs/directoryFixes #1487