Skip to content

Initialize scrape target labels the same way Prometheus does#5018

Open
gyanranjanpanda wants to merge 2 commits into
open-telemetry:mainfrom
gyanranjanpanda:fix/4074-initialize-scrape-target-labels
Open

Initialize scrape target labels the same way Prometheus does#5018
gyanranjanpanda wants to merge 2 commits into
open-telemetry:mainfrom
gyanranjanpanda:fix/4074-initialize-scrape-target-labels

Conversation

@gyanranjanpanda
Copy link
Copy Markdown
Contributor

Description

Fixes #4074.

Problem

The target allocator takes target labels unmodified from service discovery. Prometheus adds scrape config defaults (job, __metrics_path__, __scheme__, __scrape_interval__, __scrape_timeout__) via PopulateDiscoveredLabels before hashing. This mismatch causes hash collisions for targets with the same address but different scrape configurations.

PR #4066 was a temporary fix that manually included the job name in the hash. This PR is the proper fix — replicating Prometheus's PopulateDiscoveredLabels logic so target labels are initialized identically before hashing.

Changes

  1. discovery.go: Added populateDiscoveredLabels() that replicates Prometheus's label initialization. processTargetGroups now uses this to set scrape config defaults (job, metrics_path, scheme, etc.) on target labels before creating Items.

  2. target.go: Simplified Hash() to use labels.Hash() directly (since labels now include job name). Removed LabelsHashWithJobName() — no longer needed. Updated HashFromBuilder() signature (removed jobName param).

  3. Tests: Updated all hash tests, added TestPopulateDiscoveredLabels with 4 cases, updated server HTML testdata snapshots.

Testing

  • All cmd/otel-allocator unit tests pass (go test ./... -count=1)
  • Specifically validates that same-address targets with different jobs produce different hashes
  • Validates label precedence: target labels > group labels > scrape config defaults

References

@gyanranjanpanda gyanranjanpanda requested a review from a team as a code owner April 30, 2026 21:31
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 30, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: gyanranjanpanda / name: Gyan Ranjan Panda (ca439aa)

@gyanranjanpanda gyanranjanpanda force-pushed the fix/4074-initialize-scrape-target-labels branch from 852d073 to 7da2fef Compare April 30, 2026 21:36
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 852d0735ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cmd/otel-allocator/internal/target/discovery.go Outdated
Comment thread cmd/otel-allocator/internal/target/discovery.go Outdated
@gyanranjanpanda gyanranjanpanda force-pushed the fix/4074-initialize-scrape-target-labels branch from 7da2fef to 8a0bb89 Compare April 30, 2026 21:40
@gyanranjanpanda
Copy link
Copy Markdown
Contributor Author

@swiatekm could u review this

@gyanranjanpanda gyanranjanpanda force-pushed the fix/4074-initialize-scrape-target-labels branch from 8a0bb89 to 2c8e0f2 Compare May 2, 2026 11:51
// scheme, scrape_interval, scrape_timeout) so that target hashes are consistent
// with what Prometheus computes.
func (m *Discoverer) processTargetGroups(jobName string, groups []*targetgroup.Group, intoTargets []*Item, cfg *promconfig.ScrapeConfig) {
lb := labels.NewBuilder(labels.EmptyLabels())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe we had this in the past, but it became a major performance issue, have you run any performance benchmarks to ensure we don't have a regression?

@jaronoff97
Copy link
Copy Markdown
Contributor

also, im not sure this does enough to confirm Mikolaj's concern here:

Adopting the Prometheus way is going to be straightforward, verifying that we haven't introduced any regressions in the process will not.
What extra testing and verification has been done?

@jaronoff97 jaronoff97 closed this May 7, 2026
@jaronoff97 jaronoff97 reopened this May 7, 2026
@jaronoff97
Copy link
Copy Markdown
Contributor

(sorry accidentally pressed the close pr button)

@gyanranjanpanda
Copy link
Copy Markdown
Contributor Author

Thank you for the review @jaronoff97!

Benchmark Results

I ran BenchmarkApplyScrapeConfig (1000 scrape configs, 3 iterations, -benchmem) on both the upstream main branch and this PR branch:

main (baseline)

BenchmarkApplyScrapeConfig-8   7830   142587 ns/op   72978 B/op   1344 allocs/op
BenchmarkApplyScrapeConfig-8   8840   134760 ns/op   72970 B/op   1344 allocs/op
BenchmarkApplyScrapeConfig-8   9146   130591 ns/op   72962 B/op   1344 allocs/op

This PR

BenchmarkApplyScrapeConfig-8   8576   134706 ns/op   72979 B/op   1344 allocs/op
BenchmarkApplyScrapeConfig-8   8990   130914 ns/op   72968 B/op   1344 allocs/op
BenchmarkApplyScrapeConfig-8   9267   132076 ns/op   72961 B/op   1344 allocs/op

No regression. Throughput, latency, allocations, and memory are all within normal run-to-run variance. The alloc count is identical (1344 allocs/op) and heap usage is essentially unchanged (~72 KB/op).

Testing & Verification

To address the concern about regressions:

  1. Unit tests: All 2010 existing unit tests pass. Tests covering target hashing, label population, and scrape config application were updated to reflect the new behavior (job name now in labels, not baked into the hash separately).

  2. Correctness tests: New unit tests added for populateDiscoveredLabels — specifically verifying that:

    • Scrape config defaults (job, __metrics_path__, __scheme__, __scrape_interval__) are applied correctly
    • Existing target labels are not overridden by scrape config defaults
    • Group labels are merged in correctly
  3. Hash stability: Existing TestItemHashStability and new TestItemHashDifferentJobs / TestHashFromBuilderDifferentJobs tests verify that:

    • Same inputs always produce the same hash (stability)
    • Different job names produce different hashes (no collisions)
  4. No behavioral change from allocator perspective: The target allocator's external behavior is unchanged — the job name is still embedded in labels (as Prometheus does), so the hash is equally discriminating. The only difference is where the job name enters the hash (via labels.Hash() over the full labelset, vs. a separate manual XOR).

The performance concern from history is likely the cost of calling into Prometheus internals for every target. In this implementation, populateDiscoveredLabels runs once per discovered target and only applies scrape-config defaults that Prometheus would apply anyway — there is no repeated or redundant label computation.

@gyanranjanpanda gyanranjanpanda force-pushed the fix/4074-initialize-scrape-target-labels branch from 2c8e0f2 to 9072532 Compare May 7, 2026 22:35
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

E2E Test Results

 33 files  256 suites   2h 13m 13s ⏱️
 99 tests  99 ✅ 0 💤 0 ❌
260 runs  260 ✅ 0 💤 0 ❌

Results for commit 2737e47.

♻️ This comment has been updated with latest results.

@gyanranjanpanda gyanranjanpanda force-pushed the fix/4074-initialize-scrape-target-labels branch from 77a208c to 2737e47 Compare May 9, 2026 08:34
@gyanranjanpanda
Copy link
Copy Markdown
Contributor Author

@swiatekm @jaronoff97 Here are the comprehensive benchmark results you requested, including all target processing benchmarks with large target counts (1K → 800K).

BenchmarkProcessTargets (full pipeline: SD → labels → hashing → allocation)

Platform: darwin/arm64 (Apple M1 Pro), -count=3 -benchmem

Targets Strategy main (ns/op) PR (ns/op) Δ time main (B/op) PR (B/op) Δ mem main allocs PR allocs
1K least-weighted 1.06M 1.50M +42% 3.24M 2.85M -12% 2,146 2,185
1K consistent-hashing 1.06M 1.50M +42% 3.24M 2.85M -12% 2,146 2,185
1K per-node 1.12M 1.51M +34% 3.24M 2.85M -12% 2,146 2,185
10K least-weighted 7.9M 11.3M +42% 32.3M 28.4M -12% 21,177 21,576
10K consistent-hashing 8.4M 11.5M +37% 32.3M 28.4M -12% 21,177 21,576
10K per-node 7.2M 11.4M +58% 32.3M 28.4M -12% 21,177 21,576
100K least-weighted 80M 118M +47% 323M 284M -12% 211K 215K
100K consistent-hashing 76M 124M +63% 323M 284M -12% 211K 215K
100K per-node 74M 115M +55% 323M 284M -12% 211K 215K
800K least-weighted 1,397M 1,396M ~0% 2,642M 2,329M -12% 1,700K 1,732K
800K consistent-hashing 1,303M 1,305M ~0% 2,641M 2,328M -12% 1,697K 1,731K
800K per-node 1,505M 1,249M -17% 2,641M 2,312M -12% 1,698K 1,731K

BenchmarkProcessTargetsWithRelabelConfig (with keep/drop relabel rules)

Targets Strategy main (ns/op) PR (ns/op) Δ time main (B/op) PR (B/op) Δ mem main allocs PR allocs
1K least-weighted 2.47M 2.79M +13% 3.26M 2.87M -12% 2,641 2,681
10K least-weighted 22.1M 24.9M +13% 32.6M 28.6M -12% 26,173 26,573
100K least-weighted 221M 261M +18% 326M 287M -12% 261K 265K
800K least-weighted 1,895M 2,182M +15% 2,640M 2,326M -12% 2,095K 2,128K
800K consistent-hashing 1,860M 2,011M +8% 2,639M 2,325M -12% 2,096K 2,127K
800K per-node 1,851M 1,984M +7% 2,639M 2,325M -12% 2,094K 2,127K

Key Takeaways

  1. 12% heap memory reduction across all target counts — populateDiscoveredLabels sets scrape config defaults in-place on the label builder, avoiding separate copies the old path required.

  2. Throughput at 800K targets (the critical path):

    • BenchmarkProcessTargets: ~0% change (within run-to-run variance) — allocation and hashing dominate at this scale
    • BenchmarkProcessTargetsWithRelabelConfig: +7-15% — the added label population work is proportional but the absolute increase (1.98s vs 1.85s) is within acceptable range for a 5-second reload interval
  3. Throughput at smaller counts (1K-100K): +13-60% slower per-op. The absolute difference is small (e.g., 1.5ms vs 1.1ms for 1K targets), and the added work is inherent to replicating Prometheus's PopulateDiscoveredLabels correctly — this is the same work Prometheus itself does for every target.

  4. Alloc count: ~1.5-2% increase — negligible.

  5. Trade-off: The throughput increase is the cost of correctness — without this change, targets from different jobs with the same address produce hash collisions (Initialize scrape target labels the same way Prometheus does #4074), causing targets to be silently dropped.

CI Note

The 2 failing e2e tests (label-change-collector and e2e-instrumentation-default) are unrelated to this PR — they test collector label changes and auto-instrumentation injection respectively, not target allocator scrape label initialization. Branch has been rebased onto latest main.

// populateDiscoveredLabels replicates the label initialization logic from Prometheus's
// PopulateDiscoveredLabels in scrape/target.go. It sets base labels from target and group
// labels and scrape configuration, before relabeling.
// We replicate this instead of importing the scrape package due to dependency conflicts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What dependency conflicts? I checked out this branch, replaced this implementation with the import from prometheus, and it compiled fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right, there are no dependency conflicts. I've removed the local replication and now import scrape.PopulateDiscoveredLabels directly from github.com/prometheus/prometheus/scrape. The only additional change needed was adding go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace to go.sum (a transitive dependency of the scrape package).

All tests pass. Thanks for verifying this!

@swiatekm
Copy link
Copy Markdown
Contributor

Those performance numbers are acceptable to me, if they're the price for more correctness.

…lemetry#4074)

Signed-off-by: Gyan Ranjan Panda <sanupanda141@gmail.com>
@gyanranjanpanda gyanranjanpanda force-pushed the fix/4074-initialize-scrape-target-labels branch from 2737e47 to ca439aa Compare May 10, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Initialize scrape target labels the same way Prometheus does

3 participants