[metrics] Emit concurrency_slot counters as aggregates#4738
[metrics] Emit concurrency_slot counters as aggregates#4738AhmedSoliman wants to merge 2 commits into
Conversation
a7b410f to
00c49f1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00c49f1e56
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| &entry_id.to_invocation_id(qid.partition_key()) | ||
| .expect("This version does not support yielding vqueues entries other than invocations"), |
There was a problem hiding this comment.
Avoid panicking on yielded state-mutation entries
SchedulerAction::Yield now unconditionally converts yield_action.key.entry_id() to an InvocationId with expect(...), but the scheduler can yield any running head, including EntryKind::StateMutation entries (see attempt_to_run handling both invocation and state mutation kinds). In that case to_invocation_id(...) returns None and this expect will panic the partition state machine, turning a recoverable scheduling condition into a crash.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Not true in this version. State mutations will never be observed as running (we transition to finished in the same write batch as we start running them). Also, invoker never sees them.
Test Results 5 files - 3 5 suites - 3 2m 5s ⏱️ - 2m 44s Results for commit a50ccee. ± Comparison against base commit 5ea7c39. This pull request removes 60 and adds 35 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
4fa9c1b to
f3d79db
Compare
5f322cd to
686642c
Compare
tillrohrmann
left a comment
There was a problem hiding this comment.
Thanks for future proofing the invoker metrics @AhmedSoliman. I had one question regarding keeping the per invoker concurrency limit in the presence of aggregate acquire and release metrics. Apart from that, +1 for merging :-)
|
|
||
| These counters make it easy to derive: | ||
| - Rate of slot acquisition and release | ||
| - Available slots: `restate.invoker.concurrency_limit - (restate.invoker.concurrency_slots.acquired - restate.invoker.concurrency_slots.released)` |
There was a problem hiding this comment.
This probably makes no longer sense if vqueues are disabled (limits are being reported per invoker id) and we are exposing the acquired and released concurrency slots metric as an aggregate.
There was a problem hiding this comment.
Right now, users won't be able to calculate the available slots per invoker if vqueues are disabled.
| gauge!(INVOKER_CONCURRENCY_LIMIT, "invoker_id" => invoker_id) | ||
| .set(available_slots.get() as f64); |
There was a problem hiding this comment.
Wondering whether this metric makes sense to keep if users can no longer calculate the available slots per invoker.
Full implementation for yielding invocations from the invoker back to the vqueues scheduler. This unifies with the existing support for memory-budget-based invocation yielding.
- With VQueues enabled, the concurrency limit is per-node level so we remove the extra label (invoker_id) - Removing invoker_id from concurrency_slot.acquired/released metrics to reduce cardinality and to be forward compatible with vqueues
Stack created with Sapling. Best reviewed with ReviewStack.