Skip to content

Add stream topic label to realtime ingestion delay metrics#18175

Open
rsrkpatwari1234 wants to merge 56 commits into
apache:masterfrom
rsrkpatwari1234:rsrkpatwari1234-issue-18099
Open

Add stream topic label to realtime ingestion delay metrics#18175
rsrkpatwari1234 wants to merge 56 commits into
apache:masterfrom
rsrkpatwari1234:rsrkpatwari1234-issue-18099

Conversation

@rsrkpatwari1234
Copy link
Copy Markdown
Contributor

@rsrkpatwari1234 rsrkpatwari1234 commented Apr 12, 2026

Problem

With multi-topic ingestion, ingestion delay gauges only keyed by table and partition group id made it hard to tell which Kafka (or other stream) topic was behind, often requiring indirect mapping from partition to topic in config (apache/pinot#18099).

Approach

Reuse the same metric table key pattern as stream consumers (tableNameWithType-topic-streamPartitionId, plus optional instance consumer client id suffix when set), matching RealtimeSegmentDataManager’s _clientId shape. Register gauges with setOrUpdateTableGauge / removeTableGauge instead of setOrUpdatePartitionGauge / removePartitionGauge, so existing JMX → Prometheus rules can attach topic and partition labels.

Changes

  • Added IngestionConfigUtils.getStreamIngestionMetricTableKey(...) in pinot-spi.
  • IngestionDelayTracker builds that key per Pinot partition (using existing multi-stream helpers) and applies it to delay, end-to-end delay, reporting status, and offset-related gauges when supported.
  • ServerPrometheusMetricsTest treats those ingestion gauges like other client-id stream metrics for export assertions.
  • IngestionConfigUtilsTest covers the new helper

Note : Prometheus series names / label sets for these gauges change vs the old “table + partition id only” shape; dashboards and alerts that targeted the old layout need updating.

Fixes #18099

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 63.33333% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.72%. Comparing base (e76da0e) to head (0743774).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...e/data/manager/realtime/IngestionDelayTracker.java 59.25% 11 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18175      +/-   ##
============================================
- Coverage     63.73%   63.72%   -0.01%     
  Complexity     1932     1932              
============================================
  Files          3292     3292              
  Lines        201503   201517      +14     
  Branches      31320    31321       +1     
============================================
- Hits         128429   128426       -3     
- Misses        62794    62808      +14     
- Partials      10280    10283       +3     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 63.72% <63.33%> (-0.01%) ⬇️
temurin 63.72% <63.33%> (-0.01%) ⬇️
unittests 63.72% <63.33%> (-0.01%) ⬇️
unittests1 55.77% <63.33%> (-0.01%) ⬇️
unittests2 35.24% <6.66%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@noob-se7en noob-se7en added ingestion Related to data ingestion pipeline metrics Related to metrics emission and collection real-time Related to realtime table ingestion and serving labels Apr 13, 2026
@noob-se7en
Copy link
Copy Markdown
Contributor

Here the clientId is used as the table name, which is kinda hack but we can use the same. Lets add a new method to IngestionConfigUtils which returns clientId without suffix. The value returned by it can be used as the new metric name in createMetrics.
This should work, we can just double check against prometheus jmx rules present in server.yml file that existing metric name will not break and topic label gets added to the metric.

Adopted suggestion: added IngestionConfigUtils.getStreamConsumerClientIdWithoutSuffix and introduced _streamConsumerMetricBaseKey in RealtimeSegmentDataManager for all ServerMetrics table keys, while keeping full _clientId for stream consumer APIs and logging.

AbstractMetrics.composeStreamTopicPartitionKey now calls the same helper to avoid duplication. Checked server.yml: the table+topic+partition gauge/meter patterns still apply, and topic/partition labels stay correct (especially when a consumer client id suffix is configured).

Will request to re-look the suggestion I made above. Lets remove the everything new added to Abstract Metrics.
We should make use of method: _serverMetrics.setValueOfTableGauge(_clientId, meterName, value); (where clientId has topicName and partition name including tableName).

@rseetham
Copy link
Copy Markdown
Contributor

@rsrkpatwari1234 what's the eta for this? I need it for logging.

Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change the emitted metric name? Even for single topic ingestion?

Comment thread pinot-spi/src/main/java/org/apache/pinot/spi/utils/IngestionConfigUtils.java Outdated
public static String getStreamIngestionMetricTableKey(String tableNameWithType, String topicName,
int streamPartitionId, String consumerClientIdSuffix) {
if (StringUtils.isNotBlank(consumerClientIdSuffix)) {
return tableNameWithType + "-" + topicName + "-" + streamPartitionId + "-" + consumerClientIdSuffix;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(MAJOR) Is this a behavior change? This is not the same way how AbstractMetrics connecting parts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no behavior change. Realtime consumption works in same manner. _clientId in RealtimeSegmentDataManager is already built with the same rule: append consumer.client.id.suffix only when StringUtils.isNotBlank(...). getStreamIngestionMetricTableKey mirrors that so ingestion-delay metric keys stay consistent with the consumer’s metric table key.

AbstractMetrics does not define how topic, partition, or suffix are encoded. It only receives a fully formed first segment and composes names like gaugeName + "." + getTableName(thatString). That is why we added the function here

List<StreamConfig> streamConfigs = IngestionConfigUtils.getStreamConfigs(tableConfig);
StreamConfig streamConfig = IngestionConfigUtils.getStreamConfigFromPinotPartitionId(streamConfigs,
pinotPartitionId);
int streamPartitionId =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(MAJOR) Is this behavior change? What happens to single-topic stream?

Copy link
Copy Markdown
Contributor Author

@rsrkpatwari1234 rsrkpatwari1234 May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single-topic tables, IngestionConfigUtils.getStreamPartitionIdFromPinotPartitionId(tableConfig, pinotPartitionId) returns pinotPartitionId as-is (no % 10000 remap). That matches RealtimeSegmentDataManager, which uses _streamPartitionId = _partitionGroupId when there is only one stream. So which physical partition the delay refers to, is unchanged for single-topic.

We only change the metric identity from “table + Pinot partition only” to the same tableNameWithType-topic-streamPartition shape as stream consumer metrics, so a topic label (and consistent partition labeling) appears. Hence users would have to update the dashboards and alerts which rely on these metrices.

rsrkpatwari1234 and others added 3 commits May 20, 2026 00:12
…nfigUtils.java

Co-authored-by: Xiaotian (Jackie) Jiang <17555551+Jackie-Jiang@users.noreply.github.com>
@rsrkpatwari1234
Copy link
Copy Markdown
Contributor Author

Will this change the emitted metric name? Even for single topic ingestion?

Yes. For single-topic tables as well, the underlying metric name string changes, not only for multi-topic.

Before: gauges looked like
…realtimeIngestionDelayMs.<tableNameWithType>.<pinotPartitionGroupId>
(table + numeric partition as a separate dot-separated segment).

After: the “table” segment is
<tableNameWithType>-<topic>-<streamPartition>
(same pattern as stream consumer metrics), so the full name includes topic and partition in that hyphenated block, and registration uses table gauges (no extra .partition segment the same way as before).

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

Will this change the emitted metric name? Even for single topic ingestion?

Yes. For single-topic tables as well, the underlying metric name string changes, not only for multi-topic.

Before: gauges looked like …realtimeIngestionDelayMs.<tableNameWithType>.<pinotPartitionGroupId> (table + numeric partition as a separate dot-separated segment).

After: the “table” segment is <tableNameWithType>-<topic>-<streamPartition> (same pattern as stream consumer metrics), so the full name includes topic and partition in that hyphenated block, and registration uses table gauges (no extra .partition segment the same way as before).

This is a big backward incompatible change, which will break all monitoring system right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion Related to data ingestion pipeline metrics Related to metrics emission and collection real-time Related to realtime table ingestion and serving

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingestion delay metrics should output a tag for the topic name

5 participants