Add stream topic label to realtime ingestion delay metrics#18175
Add stream topic label to realtime ingestion delay metrics#18175rsrkpatwari1234 wants to merge 56 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18175 +/- ##
============================================
- Coverage 63.73% 63.72% -0.01%
Complexity 1932 1932
============================================
Files 3292 3292
Lines 201503 201517 +14
Branches 31320 31321 +1
============================================
- Hits 128429 128426 -3
- Misses 62794 62808 +14
- Partials 10280 10283 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Will request to re-look the suggestion I made above. Lets remove the everything new added to Abstract Metrics. |
|
@rsrkpatwari1234 what's the eta for this? I need it for logging. |
| public static String getStreamIngestionMetricTableKey(String tableNameWithType, String topicName, | ||
| int streamPartitionId, String consumerClientIdSuffix) { | ||
| if (StringUtils.isNotBlank(consumerClientIdSuffix)) { | ||
| return tableNameWithType + "-" + topicName + "-" + streamPartitionId + "-" + consumerClientIdSuffix; |
There was a problem hiding this comment.
(MAJOR) Is this a behavior change? This is not the same way how AbstractMetrics connecting parts
There was a problem hiding this comment.
There is no behavior change. Realtime consumption works in same manner. _clientId in RealtimeSegmentDataManager is already built with the same rule: append consumer.client.id.suffix only when StringUtils.isNotBlank(...). getStreamIngestionMetricTableKey mirrors that so ingestion-delay metric keys stay consistent with the consumer’s metric table key.
AbstractMetrics does not define how topic, partition, or suffix are encoded. It only receives a fully formed first segment and composes names like gaugeName + "." + getTableName(thatString). That is why we added the function here
| List<StreamConfig> streamConfigs = IngestionConfigUtils.getStreamConfigs(tableConfig); | ||
| StreamConfig streamConfig = IngestionConfigUtils.getStreamConfigFromPinotPartitionId(streamConfigs, | ||
| pinotPartitionId); | ||
| int streamPartitionId = |
There was a problem hiding this comment.
(MAJOR) Is this behavior change? What happens to single-topic stream?
There was a problem hiding this comment.
For single-topic tables, IngestionConfigUtils.getStreamPartitionIdFromPinotPartitionId(tableConfig, pinotPartitionId) returns pinotPartitionId as-is (no % 10000 remap). That matches RealtimeSegmentDataManager, which uses _streamPartitionId = _partitionGroupId when there is only one stream. So which physical partition the delay refers to, is unchanged for single-topic.
We only change the metric identity from “table + Pinot partition only” to the same tableNameWithType-topic-streamPartition shape as stream consumer metrics, so a topic label (and consistent partition labeling) appears. Hence users would have to update the dashboards and alerts which rely on these metrices.
…nfigUtils.java Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
Yes. For single-topic tables as well, the underlying metric name string changes, not only for multi-topic. Before: gauges looked like After: the “table” segment is |
This is a big backward incompatible change, which will break all monitoring system right? |
Problem
With multi-topic ingestion, ingestion delay gauges only keyed by table and partition group id made it hard to tell which Kafka (or other stream) topic was behind, often requiring indirect mapping from partition to topic in config (apache/pinot#18099).
Approach
Reuse the same metric table key pattern as stream consumers (
tableNameWithType-topic-streamPartitionId, plus optional instance consumer client id suffix when set), matchingRealtimeSegmentDataManager’s_clientIdshape. Register gauges withsetOrUpdateTableGauge/removeTableGaugeinstead ofsetOrUpdatePartitionGauge/removePartitionGauge, so existing JMX → Prometheus rules can attach topic and partition labels.Changes
IngestionConfigUtils.getStreamIngestionMetricTableKey(...)in pinot-spi.IngestionDelayTrackerbuilds that key per Pinot partition (using existing multi-stream helpers) and applies it to delay, end-to-end delay, reporting status, and offset-related gauges when supported.ServerPrometheusMetricsTesttreats those ingestion gauges like other client-id stream metrics for export assertions.IngestionConfigUtilsTestcovers the new helperNote : Prometheus series names / label sets for these gauges change vs the old “table + partition id only” shape; dashboards and alerts that targeted the old layout need updating.
Fixes #18099