Skip to content

ETW receiver with control path and session#2930

Merged
drewrelmas merged 19 commits into
open-telemetry:mainfrom
swashtek:Add/etw_data_path
May 20, 2026
Merged

ETW receiver with control path and session#2930
drewrelmas merged 19 commits into
open-telemetry:mainfrom
swashtek:Add/etw_data_path

Conversation

@swashtek
Copy link
Copy Markdown
Contributor

@swashtek swashtek commented May 11, 2026

Change Summary

ETW receiver initial PR

  • Control path hanges
  • ETW session to consume using one-collect git commit

What issue does this PR close?

  • Closes #NNN

How are these changes tested?

Validation using cargo check, run etc.
Local validation with ETW console yaml

cargo run --features etw-receiver -- -c configs/etw-console.yaml

Are there any user-facing changes?

N/A

@swashtek swashtek requested a review from a team as a code owner May 11, 2026 20:25
@github-actions github-actions Bot added the rust Pull requests that update Rust code label May 11, 2026
Comment thread rust/otap-dataflow/Cargo.toml Outdated
@swashtek swashtek changed the title ETW receiver with control path and session [WIP] ETW receiver with control path and session May 11, 2026
Comment thread rust/otap-dataflow/configs/etw-console.yaml Outdated
Comment thread rust/otap-dataflow/configs/etw-console.yaml Outdated
Comment thread rust/otap-dataflow/crates/core-nodes/src/receivers/etw_receiver/session.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs Outdated
@swashtek swashtek force-pushed the Add/etw_data_path branch from a67d25d to ce46830 Compare May 11, 2026 21:36
Comment thread rust/otap-dataflow/crates/core-nodes/src/receivers/mod.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.03%. Comparing base (fabbe70) to head (d50c73e).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2930      +/-   ##
==========================================
- Coverage   86.03%   86.03%   -0.01%     
==========================================
  Files         727      727              
  Lines      277228   277228              
==========================================
- Hits       238526   238508      -18     
- Misses      38178    38196      +18     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 87.17% <ø> (-0.01%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 89.57% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver 92.18% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs Outdated
Comment thread rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs Outdated
Copy link
Copy Markdown
Contributor

@utpilla utpilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @swashtek!

@swashtek swashtek changed the title [WIP] ETW receiver with control path and session ETW receiver with control path and session May 18, 2026
Copilot AI review requested due to automatic review settings May 18, 2026 23:50
@swashtek swashtek force-pushed the Add/etw_data_path branch from 0f21920 to dc6d279 Compare May 18, 2026 23:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial Windows ETW receiver behind the etw-receiver feature, including session management via one_collect, receiver registration, metrics, and a sample ETW-to-console config.

Changes:

  • Registers a Windows-only ETW receiver module and feature flag.
  • Adds ETW receiver configuration, lifecycle/control handling, metrics, and singleton ETW session fan-out.
  • Adds a sample configs/etw-console.yaml.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
rust/otap-dataflow/crates/contrib-nodes/src/receivers/mod.rs Gates and exposes the ETW receiver module on Windows.
rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs Adds ETW session setup, provider resolution, event channel fan-out, and tests.
rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs Adds receiver config, factory registration, control loop, and metrics.
rust/otap-dataflow/crates/contrib-nodes/Cargo.toml Adds target-specific one_collect dependency and ETW feature.
rust/otap-dataflow/configs/etw-console.yaml Adds a sample ETW receiver to console exporter pipeline.
rust/otap-dataflow/Cargo.toml Adds workspace one_collect dependency and top-level ETW feature.
Comments suppressed due to low confidence (4)

rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs:189

  • This process-global pool is not keyed by session_name or provider configuration, so a second ETW receiver node/pipeline with a different config will silently reuse the first session and consume one of its receivers (or fail with pool exhaustion). The singleton state needs to either be per session/config or reject mismatched subsequent subscriptions explicitly.
static SESSION: Mutex<Option<Vec<mpsc::Receiver<EtwEventData>>>> = Mutex::new(None);

rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs:279

  • The captured event metadata is hard-coded to zero for the event descriptor fields, so the receiver reports every ETW event as ID/opcode/version/level/keywords 0. That makes the emitted telemetry/logging inaccurate and prevents downstream consumers from distinguishing event types.
                        // TODO: populate event_id/opcode/level/keywords/version
                        // once WindowsEventExtension exposes EVENT_DESCRIPTOR.
                        event_id: 0,
                        opcode: 0,
                        version: 0,
                        level: 0,
                        keywords: 0,

rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs:300

  • parse_until failures are discarded, so permission errors, session-name conflicts, or other ETW startup/runtime failures leave the receiver initialized but with no actionable error for the user. The session thread should log/report this result and propagate startup failures synchronously where possible.
            // `parse_until` blocks on `ProcessTrace`.  We never signal stop,
            // so the session runs until the process exits.
            let _result = session.parse_until(&session_name, || false);

rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs:309

  • The receiver counts and logs ETW events but never sends any OtapPdata to the effect handler, so an ETW-to-console pipeline receives no data despite this receiver being wired as a source. This needs to build and forward records (or the wiring/sample should not imply downstream output) before the receiver is functional.
                            // TODO: Convert event data to Arrow record batches
                            // and forward downstream via effect_handler.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/otap-dataflow/crates/contrib-nodes/Cargo.toml
Comment thread rust/otap-dataflow/configs/etw-console.yaml Outdated
@swashtek swashtek force-pushed the Add/etw_data_path branch from dc6d279 to 08e33ef Compare May 18, 2026 23:56
@utpilla
Copy link
Copy Markdown
Contributor

utpilla commented May 19, 2026

@open-telemetry/arrow-approvers The CI Check for validate-configs is failing for etw-console.yaml because ETW receiver is a Windows-only component and would not work on Linux runners. I have created an issue #3041 to track the changes needed to CI workflow to accommodate this. We could address them in a follow-up PR.

/// We use `Mutex<Option<Vec<…>>>` rather than `OnceLock` / `LazyLock` because:
/// - Initialization is fallible (GUID parsing, thread spawn).
/// - We need post-init mutation (`Vec::pop`).
static SESSION: Mutex<Option<Vec<mpsc::Receiver<EtwEventData>>>> = Mutex::new(None);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: should this singleton track the config it was initialized with? Since SESSION is process-global and the pool is consumed with pop(), a later ETW receiver or restart in the same process may not behave as expected. If full scoped/resettable state is follow-up work, a clear error for mismatched/additional subscriptions would still make this safer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with restarts and hot-reload is being tracked in #3033. I think some of what you're saying would also be covered in it.

a clear error for mismatched/additional subscriptions would still make this safer.

Could you explain in more detail what you're looking for here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with restarts and hot-reload is being tracked in https://github.com/open-telemetry/otel arrow/issues/3033. I think some of what you're saying would also be covered in it.

Thanks, #3033 covers the restart/shutdown lifecycle part.

Could you explain in more detail what you're looking for here?

What I meant here is only the case where the singleton is already running and another ETW receiver comes in with a different session name or provider list. Today it seems like that second receiver would just reuse the first session. I was wondering if we should detect that and return a clear error, instead of making it look like the second config was applied.


// `parse_until` blocks on `ProcessTrace`. We never signal stop,
// so the session runs until the process exits.
let _result = session.parse_until(&session_name, || false);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid dropping this result? If ETW fails to start or exits because of permissions/session/provider errors, the receiver currently looks initialized and later just sees a closed channel. At minimum this should log the error, and ideally startup failures should be surfaced before the receiver is considered ready.

Copy link
Copy Markdown
Contributor

@utpilla utpilla May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lalitb A lot of this (the APIs being called) might be changed when TDH decoding support is added to one-collect.

@swashtek For now, you could add a TODO here to log the result so that we remember to do it if we continue to keep this code in future.


// Best-effort send; if this core's channel is full,
// drop the event for that core only.
let _ = txs[i % txs.len()].try_send(data);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to drops the ETW event silently when the per-core channel is full. Since this is the main backpressure/loss path in the scaffold, can we record a drop counter or at least log it in a rate-limited way? Also the comment says “for that core only”, but the event is assigned to one core, so this drops it from the pipeline.

# normal dependency graph and no longer needs to be pulled directly here. The
# pinned commit is from 2026-04-10.
one_collect = { git = "https://github.com/microsoft/one-collect.git", rev = "9292caacaddf9ff9e4fbdf77bc62b5ec25494c84", features = ["scripting"], optional = true }
one_collect = { workspace = true, optional = true }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - small cleanup

  [target.'cfg(any(windows, target_os = "linux"))'.dependencies]
  one_collect = { workspace = true, optional = true }

zip = "=8.6.0"
byte-unit = { version = "5.2.0", features = ["serde"] }
cpu-time = "1.0.0"
one_collect = { git = "https://github.com/microsoft/one-collect.git", rev = "cfe3f78" }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - Can we use the full commit SHA here - would keep this consistent and easier to audit later.

Copy link
Copy Markdown
Member

@lalitb lalitb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Since this is an iterative PR, most of my comments can be handled in follow-ups.

@swashtek swashtek force-pushed the Add/etw_data_path branch from e922b73 to d50c73e Compare May 20, 2026 18:12
@drewrelmas
Copy link
Copy Markdown
Contributor

Setting to merge on the hope that fixing #3041 is a fast follow-up to prevent confusing check failures for unrelated PRs.

@drewrelmas drewrelmas added this pull request to the merge queue May 20, 2026
Merged via the queue into open-telemetry:main with commit 5b1d9be May 20, 2026
86 of 87 checks passed
guancioul pushed a commit to guancioul/otel-arrow that referenced this pull request May 21, 2026
# Change Summary

ETW receiver initial PR

- Control path hanges
- ETW session to consume using one-collect git commit

## What issue does this PR close?
* Closes #NNN

## How are these changes tested?
Validation using cargo check, run etc. 
Local validation with ETW console yaml

`cargo run --features etw-receiver -- -c configs/etw-console.yaml`

## Are there any user-facing changes?
N/A

---------

Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants