Skip to content

fix(opentelemetry): shut exporter down in processor terminate (fixes #868 shutdown crash)#987

Open
dl-alexandre wants to merge 1 commit into
open-telemetry:mainfrom
dl-alexandre:fix/processor-terminate-exporter-shutdown
Open

fix(opentelemetry): shut exporter down in processor terminate (fixes #868 shutdown crash)#987
dl-alexandre wants to merge 1 commit into
open-telemetry:mainfrom
dl-alexandre:fix/processor-terminate-exporter-shutdown

Conversation

@dl-alexandre
Copy link
Copy Markdown

What

otel_batch_processor:terminate/3 and otel_simple_processor:terminate/3 now call otel_exporter:shutdown(Exporter) before returning, so the exporter performs its synchronous cleanup while its transport's dependencies are still alive.

Why

Issue #868 has two distinct failure modes tangled in the same thread; this PR addresses the second — the recurring ArgumentError: the table identifier does not refer to an existing ETS table reported by @bernardo-martinez (on collector restart) and @ckampfecars (on SIGTERM):

```
:gen_statem {:n, :l, {:grpcbox_channel, …}} terminating
** (ArgumentError) errors were found at the given arguments:

  • 1st argument: the table identifier does not refer to an existing ETS table
    (stdlib) :ets.select(:gproc, ...)
    (gproc) gproc.erl:1464: :gproc.select/2
    (grpcbox) grpcbox_channel.erl:141: :grpcbox_channel.terminate/3
    ```

Root cause. The OTLP gRPC exporter calls grpcbox_channel:start_link(self(), …) in otel_exporter_otlp:init/1, linking the channel to the exporter (which lives inside a span processor gen_statem). On shutdown the processor's terminate/3 returned without calling otel_exporter:shutdown/1, so the linked grpcbox_channel only received the link-EXIT signal and ran its own terminate/3 concurrently with the grpcbox application's shutdown. grpcbox_channel:terminate/3 calls into gproc, which may already be torn down — hence the crash on the :gproc ETS table.

otel_exporter:shutdown/1 already exists and is the exporter's documented cleanup hook; it's wired up for runtime exporter swaps (set_exporter) but was missing from process termination. otel_exporter_traces_otlp:shutdown/1 performs the synchronous grpcbox_channel:stop/1, which lets the channel run its gproc cleanup while everything it depends on is still alive.

@tsloughter — this matches the supervisor-ordering hypothesis in your 2025-07-09 comment, but lifts the fix one layer up (into the processor's terminate, where the exporter is already known) instead of restructuring the supervisor tree. Happy to follow whatever shape you prefer if you'd rather see this in opentelemetry_exporter or via an extra supervisor child.

What this does NOT fix

The original memory-leak report from @bernardo-martinez (collector becomes reachable again after a :timeout window → exporter keeps timing out for 40 min → 1 GB OOM) is a separate failure mode in the gRPC channel's state recovery and is not addressed here. That one needs deeper investigation — likely in grpcbox itself — and a clean reproducer.

Scope

  • apps/opentelemetry/src/otel_batch_processor.erl — call otel_exporter:shutdown(Exporter) after the final blocking export in terminate/3. Comment cites Exporter: timeout leak #868.
  • apps/opentelemetry/src/otel_simple_processor.erlterminate/3 previously discarded Data; now destructures #data{exporter=Exporter} and shuts it down.

Both calls use _ = to swallow the return value, matching the existing _ = export(...) pattern.

Verification

  • The fix is minimal (~14 lines) and behavior-preserving for all non-shutdown paths.
  • Existing test suites under apps/opentelemetry/test/ should continue to pass; CI here covers more OTP/Elixir matrix combinations than I can hit locally.
  • A regression test that reliably reproduces the crash needs a fake gRPC collector and process-shutdown plumbing; happy to add one in a follow-up if you'd like guidance on the preferred style.

Refs: #868

When a span processor's `gen_statem` exits, its linked transport
resources (e.g. `grpcbox_channel` started via `start_link` in the
OTLP gRPC exporter) terminate concurrently with the `grpcbox`
application itself. Their `terminate/3` calls into `gproc`, which
may already be gone, producing the recurring crash reported in open-telemetry#868:

    :gen_statem {:n, :l, {:grpcbox_channel, …}} terminating
    ** (ArgumentError) errors were found at the given arguments:
      * 1st argument: the table identifier does not refer to an
                      existing ETS table
        (stdlib) :ets.select(:gproc, ...)
        (gproc) gproc.erl:1464: :gproc.select/2
        (grpcbox) grpcbox_channel.erl:141: :grpcbox_channel.terminate/3

`otel_exporter:shutdown/1` already exists and is invoked when the
exporter is replaced at runtime (`set_exporter`), but neither
processor calls it on `terminate/3`. Add the call so the exporter
performs its synchronous cleanup — for OTLP gRPC, that's
`grpcbox_channel:stop/1` — while the `grpcbox` application and its
`gproc` ETS table are still alive, eliminating the race.

Refs: open-telemetry#868
@dl-alexandre dl-alexandre requested a review from a team as a code owner May 15, 2026 06:08
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 15, 2026

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: dl-alexandre / name: dl-alexandre (4c6ce0e)

@dl-alexandre dl-alexandre force-pushed the fix/processor-terminate-exporter-shutdown branch from 3f375de to 4c6ce0e Compare May 15, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant