Skip to content

[metrics] Emit concurrency_slot counters as aggregates#4738

Open
AhmedSoliman wants to merge 2 commits into
mainfrom
pr4738
Open

[metrics] Emit concurrency_slot counters as aggregates#4738
AhmedSoliman wants to merge 2 commits into
mainfrom
pr4738

Conversation

@AhmedSoliman
Copy link
Copy Markdown
Contributor

@AhmedSoliman AhmedSoliman commented May 14, 2026

  • With VQueues enabled, the concurrency limit is per-node level so we remove the extra label (invoker_id)
  • Removing invoker_id from concurrency_slot.acquired/released metrics to reduce cardinality and to be forward compatible with vqueues

Stack created with Sapling. Best reviewed with ReviewStack.

@AhmedSoliman AhmedSoliman force-pushed the pr4738 branch 2 times, most recently from a7b410f to 00c49f1 Compare May 14, 2026 09:10
@AhmedSoliman AhmedSoliman marked this pull request as ready for review May 14, 2026 09:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00c49f1e56

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +518 to +519
&entry_id.to_invocation_id(qid.partition_key())
.expect("This version does not support yielding vqueues entries other than invocations"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid panicking on yielded state-mutation entries

SchedulerAction::Yield now unconditionally converts yield_action.key.entry_id() to an InvocationId with expect(...), but the scheduler can yield any running head, including EntryKind::StateMutation entries (see attempt_to_run handling both invocation and state mutation kinds). In that case to_invocation_id(...) returns None and this expect will panic the partition state machine, turning a recoverable scheduling condition into a crash.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not true in this version. State mutations will never be observed as running (we transition to finished in the same write batch as we start running them). Also, invoker never sees them.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Test Results

 5 files   -   3   5 suites   - 3   2m 5s ⏱️ - 2m 44s
35 tests  -  25  35 ✅  -  25  0 💤 ±0  0 ❌ ±0 
56 runs   - 211  56 ✅  - 211  0 💤 ±0  0 ❌ ±0 

Results for commit a50ccee. ± Comparison against base commit 5ea7c39.

This pull request removes 60 and adds 35 tests. Note that renamed tests count towards both.
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[1]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[2]
dev.restate.sdktesting.tests.CallOrdering ‑ ordering(boolean[], Client)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromAdminAPI(BlockingOperation, Client, URI)[3]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[1]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[2]
dev.restate.sdktesting.tests.Cancellation ‑ cancelFromContext(BlockingOperation, Client)[3]
dev.restate.sdktesting.tests.Combinators ‑ awakeableOrTimeoutUsingAwaitAny(Client)
…
dev.restate.sdktesting.tests.AwakeableIngressEndpointTest ‑ completeWithFailure(Client)
dev.restate.sdktesting.tests.AwakeableIngressEndpointTest ‑ completeWithSuccess(Client)
dev.restate.sdktesting.tests.AwakeableLeaderTransferTest ‑ awakeableCompletionsAreNotLostDuringLeaderTransfer(Client, ContainerHandle)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$NewVersion ‑ completeAwakeable(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$NewVersion ‑ completeRetryableOperation(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$NewVersion ‑ proxyCallShouldBeDone(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$NewVersion ‑ proxyOneWayCallShouldBeDone(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$OldVersion ‑ createAwakeable(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$OldVersion ‑ startOneWayProxyCall(Client)
dev.restate.sdktesting.tests.BackwardCompatibilityTest$OldVersion ‑ startProxyCall(Client)
…

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Contributor

@muhamadazmy muhamadazmy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

Copy link
Copy Markdown
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for future proofing the invoker metrics @AhmedSoliman. I had one question regarding keeping the per invoker concurrency limit in the presence of aggregate acquire and release metrics. Apart from that, +1 for merging :-)


These counters make it easy to derive:
- Rate of slot acquisition and release
- Available slots: `restate.invoker.concurrency_limit - (restate.invoker.concurrency_slots.acquired - restate.invoker.concurrency_slots.released)`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably makes no longer sense if vqueues are disabled (limits are being reported per invoker id) and we are exposing the acquired and released concurrency slots metric as an aggregate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, users won't be able to calculate the available slots per invoker if vqueues are disabled.

Comment on lines +67 to +68
gauge!(INVOKER_CONCURRENCY_LIMIT, "invoker_id" => invoker_id)
.set(available_slots.get() as f64);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether this metric makes sense to keep if users can no longer calculate the available slots per invoker.

Full implementation for yielding invocations from the invoker back to the vqueues scheduler. This unifies with the existing support for memory-budget-based invocation yielding.
- With VQueues enabled, the concurrency limit is per-node level so we remove the extra label (invoker_id)
- Removing invoker_id from concurrency_slot.acquired/released metrics to reduce cardinality and to be forward compatible with vqueues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants