fix(opentelemetry): shut exporter down in processor terminate (fixes #868 shutdown crash)#987
Open
dl-alexandre wants to merge 1 commit into
Conversation
When a span processor's `gen_statem` exits, its linked transport resources (e.g. `grpcbox_channel` started via `start_link` in the OTLP gRPC exporter) terminate concurrently with the `grpcbox` application itself. Their `terminate/3` calls into `gproc`, which may already be gone, producing the recurring crash reported in open-telemetry#868: :gen_statem {:n, :l, {:grpcbox_channel, …}} terminating ** (ArgumentError) errors were found at the given arguments: * 1st argument: the table identifier does not refer to an existing ETS table (stdlib) :ets.select(:gproc, ...) (gproc) gproc.erl:1464: :gproc.select/2 (grpcbox) grpcbox_channel.erl:141: :grpcbox_channel.terminate/3 `otel_exporter:shutdown/1` already exists and is invoked when the exporter is replaced at runtime (`set_exporter`), but neither processor calls it on `terminate/3`. Add the call so the exporter performs its synchronous cleanup — for OTLP gRPC, that's `grpcbox_channel:stop/1` — while the `grpcbox` application and its `gproc` ETS table are still alive, eliminating the race. Refs: open-telemetry#868
|
|
3f375de to
4c6ce0e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
otel_batch_processor:terminate/3andotel_simple_processor:terminate/3now callotel_exporter:shutdown(Exporter)before returning, so the exporter performs its synchronous cleanup while its transport's dependencies are still alive.Why
Issue #868 has two distinct failure modes tangled in the same thread; this PR addresses the second — the recurring
ArgumentError: the table identifier does not refer to an existing ETS tablereported by @bernardo-martinez (on collector restart) and @ckampfecars (on SIGTERM):```
:gen_statem {:n, :l, {:grpcbox_channel, …}} terminating
** (ArgumentError) errors were found at the given arguments:
(stdlib) :ets.select(:gproc, ...)
(gproc) gproc.erl:1464: :gproc.select/2
(grpcbox) grpcbox_channel.erl:141: :grpcbox_channel.terminate/3
```
Root cause. The OTLP gRPC exporter calls
grpcbox_channel:start_link(self(), …)inotel_exporter_otlp:init/1, linking the channel to the exporter (which lives inside a span processorgen_statem). On shutdown the processor'sterminate/3returned without callingotel_exporter:shutdown/1, so the linkedgrpcbox_channelonly received the link-EXIT signal and ran its ownterminate/3concurrently with thegrpcboxapplication's shutdown.grpcbox_channel:terminate/3calls intogproc, which may already be torn down — hence the crash on the:gprocETS table.otel_exporter:shutdown/1already exists and is the exporter's documented cleanup hook; it's wired up for runtime exporter swaps (set_exporter) but was missing from process termination.otel_exporter_traces_otlp:shutdown/1performs the synchronousgrpcbox_channel:stop/1, which lets the channel run itsgproccleanup while everything it depends on is still alive.@tsloughter — this matches the supervisor-ordering hypothesis in your 2025-07-09 comment, but lifts the fix one layer up (into the processor's terminate, where the exporter is already known) instead of restructuring the supervisor tree. Happy to follow whatever shape you prefer if you'd rather see this in
opentelemetry_exporteror via an extra supervisor child.What this does NOT fix
The original memory-leak report from @bernardo-martinez (collector becomes reachable again after a
:timeoutwindow → exporter keeps timing out for 40 min → 1 GB OOM) is a separate failure mode in the gRPC channel's state recovery and is not addressed here. That one needs deeper investigation — likely ingrpcboxitself — and a clean reproducer.Scope
apps/opentelemetry/src/otel_batch_processor.erl— callotel_exporter:shutdown(Exporter)after the final blocking export interminate/3. Comment cites Exporter: timeout leak #868.apps/opentelemetry/src/otel_simple_processor.erl—terminate/3previously discardedData; now destructures#data{exporter=Exporter}and shuts it down.Both calls use
_ =to swallow the return value, matching the existing_ = export(...)pattern.Verification
apps/opentelemetry/test/should continue to pass; CI here covers more OTP/Elixir matrix combinations than I can hit locally.Refs: #868