Skip to content

fix(testutil): replace fixed sleep in WaitForSpanFlush with span polling#463

Open
mohanakatari119-bit wants to merge 6 commits into
open-telemetry:mainfrom
mohanakatari119-bit:fix/flaky-span-flush-wait
Open

fix(testutil): replace fixed sleep in WaitForSpanFlush with span polling#463
mohanakatari119-bit wants to merge 6 commits into
open-telemetry:mainfrom
mohanakatari119-bit:fix/flaky-span-flush-wait

Conversation

@mohanakatari119-bit
Copy link
Copy Markdown

@mohanakatari119-bit mohanakatari119-bit commented May 5, 2026

The WaitForSpanFlush helper used a hardcoded 200ms sleep to let spans arrive at the in-process collector before test assertions ran. On a loaded CI runner 200ms is not enough, which causes spurious failures where the collector holds zero spans at assertion time. On fast machines it is just dead time.

WaitForSpanFlush is replaced by WaitForSpans(t, c, minSpans) which polls collector.GetTraces() every 25ms and returns as soon as the expected number of spans are present, with a 3s deadline. A convenience method (f *TestFixture) WaitForSpans(minSpans int) is added so tests that already hold a fixture do not need to reach into the collector directly.

All four callers are migrated: grpc_server_test.go, grpc_shutdown_test.go, http_server_test.go each wait for 1 span; e2e/grpc_test.go waits for 4 spans (2 traces x 2 spans per trace, matching its existing RequireTraceCount(2) and RequireSpansPerTrace(2) assertions).

Two unit tests are added in readiness_test.go: one confirms the function returns immediately when spans are already present, and one confirms it successfully polls until a span arrives after a short delay.

Tests passing
image

Signed-off-by: mohanakatari119-bit <mohana.katari119@gmail.com>
@mohanakatari119-bit mohanakatari119-bit requested a review from a team as a code owner May 5, 2026 17:13
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 5, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Copilot / name: Copilot Autofix powered by AI (2176e6a)
  • ✅ login: mohanakatari119-bit / name: Mohana Katari (2176e6a)
  • ✅ login: mohanakatari119-bit / name: mohanakatari119-bit (87d4645, a446480, f5a529a)

@github-actions github-actions Bot added the scope:fix A bug that is being fixed label May 5, 2026
@y1yang0 y1yang0 requested a review from Copilot May 6, 2026 08:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves test reliability by replacing a fixed sleep-based “span flush” helper with a polling-based helper that waits until the in-process collector has received a minimum number of spans (or times out), and updates existing tests to use it.

Changes:

  • Replace WaitForSpanFlush (fixed 200ms sleep) with WaitForSpans(t, collector, minSpans) that polls for spans with a timeout.
  • Add (*TestFixture).WaitForSpans(minSpans int) convenience wrapper.
  • Migrate integration/e2e tests to wait for an explicit minimum span count and add unit tests for the new helper.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/testutil/readiness.go Replaces fixed sleep span flush helper with polling-based WaitForSpans.
test/testutil/readiness_test.go Adds unit tests covering immediate-return and delayed-arrival polling behavior.
test/testutil/fixture.go Adds a fixture method wrapper for WaitForSpans.
test/integration/http_server_test.go Uses fixture span polling instead of fixed sleep.
test/integration/grpc_shutdown_test.go Uses fixture span polling instead of fixed sleep.
test/integration/grpc_server_test.go Uses fixture span polling instead of fixed sleep.
test/e2e/grpc_test.go Uses fixture span polling with a higher minimum span count.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/testutil/readiness.go Outdated
Comment thread test/testutil/readiness_test.go Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@NameHaibinZhang NameHaibinZhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! The polling approach is a clear improvement over the fixed sleep — it's both faster and more reliable. A few comments below:

Critical: Race condition in polling loop

WaitForSpans calls AllSpans(c.GetTraces()). GetTraces() releases the mutex before returning, but AllSpans then iterates the underlying trace data without holding the lock. Meanwhile, the collector's HTTP handler goroutine may concurrently append new spans via MoveAndAppendTo.

While this is a pre-existing pattern in the codebase (e.g., RequireTraceCount, RequireSingleSpan), the old code read traces only once after a 200ms sleep — by which time spans had typically all arrived. The new tight polling loop (every 25ms) reads repeatedly while spans are still arriving, significantly widening the race window.

Since this PR's goal is to improve test reliability, introducing a data race risk works against that goal. A simple fix would be to add a SpanCount() method on Collector that counts spans under the lock:

func (c *Collector) SpanCount() int {
    c.mu.Lock()
    defer c.mu.Unlock()
    return countSpans(c.traces)
}

Then WaitForSpans only checks the count rather than iterating unlocked trace data.

Suggestion: Missing timeout path test

The tests cover "spans already present" and "spans arrive after delay", but there's no test for the timeout/failure path. Consider adding one (possibly as a follow-up if it requires making the timeout configurable).

Suggestion: Test couples to Collector internals

TestWaitForSpans_pollsUntilSpansArrive directly manipulates c.mu and c.traces. This creates coupling to internal structure. Consider using the HTTP interface to simulate span arrival for a more realistic test. (Acceptable for a unit test though — just a suggestion for robustness.)

Nit: Magic number in e2e test

f.WaitForSpans(4) — consider adding a comment explaining where 4 comes from, e.g. // 2 traces (unary + stream) × 2 spans (client + server).

Add SpanCount() method on Collector that counts spans under the mutex,
replacing the unlocked AllSpans(GetTraces()) calls in the polling loop
that widened the race window during repeated 25ms polls. Also clarify
the magic span count in the e2e test.
Extract pollForSpans helper with a configurable timeout so the failure
path can be tested without a 3-second wait. Add TestWaitForSpans_timesOut
to cover the timeout branch. Rewrite the existing polling test to use
the collector's HTTP interface via sendSpan, removing direct access to
c.mu and c.traces.
@mohanakatari119-bit mohanakatari119-bit force-pushed the fix/flaky-span-flush-wait branch from 1cd0cbd to f5a529a Compare May 11, 2026 19:04
Copy link
Copy Markdown
Contributor

@txabman42 txabman42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! TY for this improvement ❤️

Comment thread test/testutil/readiness.go Outdated
Rename for naming consistency with defaultSpanPollInterval, as suggested
by txabman42 in code review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope:fix A bug that is being fixed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants