ETW receiver with control path and session#2930
Conversation
a67d25d to
ce46830
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2930 +/- ##
==========================================
- Coverage 86.03% 86.03% -0.01%
==========================================
Files 727 727
Lines 277228 277228
==========================================
- Hits 238526 238508 -18
- Misses 38178 38196 +18
Partials 524 524
🚀 New features to boost your workflow:
|
0f21920 to
dc6d279
Compare
There was a problem hiding this comment.
Pull request overview
Adds an initial Windows ETW receiver behind the etw-receiver feature, including session management via one_collect, receiver registration, metrics, and a sample ETW-to-console config.
Changes:
- Registers a Windows-only ETW receiver module and feature flag.
- Adds ETW receiver configuration, lifecycle/control handling, metrics, and singleton ETW session fan-out.
- Adds a sample
configs/etw-console.yaml.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
rust/otap-dataflow/crates/contrib-nodes/src/receivers/mod.rs |
Gates and exposes the ETW receiver module on Windows. |
rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs |
Adds ETW session setup, provider resolution, event channel fan-out, and tests. |
rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs |
Adds receiver config, factory registration, control loop, and metrics. |
rust/otap-dataflow/crates/contrib-nodes/Cargo.toml |
Adds target-specific one_collect dependency and ETW feature. |
rust/otap-dataflow/configs/etw-console.yaml |
Adds a sample ETW receiver to console exporter pipeline. |
rust/otap-dataflow/Cargo.toml |
Adds workspace one_collect dependency and top-level ETW feature. |
Comments suppressed due to low confidence (4)
rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs:189
- This process-global pool is not keyed by
session_nameor provider configuration, so a second ETW receiver node/pipeline with a different config will silently reuse the first session and consume one of its receivers (or fail with pool exhaustion). The singleton state needs to either be per session/config or reject mismatched subsequent subscriptions explicitly.
static SESSION: Mutex<Option<Vec<mpsc::Receiver<EtwEventData>>>> = Mutex::new(None);
rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs:279
- The captured event metadata is hard-coded to zero for the event descriptor fields, so the receiver reports every ETW event as ID/opcode/version/level/keywords 0. That makes the emitted telemetry/logging inaccurate and prevents downstream consumers from distinguishing event types.
// TODO: populate event_id/opcode/level/keywords/version
// once WindowsEventExtension exposes EVENT_DESCRIPTOR.
event_id: 0,
opcode: 0,
version: 0,
level: 0,
keywords: 0,
rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/session.rs:300
parse_untilfailures are discarded, so permission errors, session-name conflicts, or other ETW startup/runtime failures leave the receiver initialized but with no actionable error for the user. The session thread should log/report this result and propagate startup failures synchronously where possible.
// `parse_until` blocks on `ProcessTrace`. We never signal stop,
// so the session runs until the process exits.
let _result = session.parse_until(&session_name, || false);
rust/otap-dataflow/crates/contrib-nodes/src/receivers/etw_receiver/mod.rs:309
- The receiver counts and logs ETW events but never sends any
OtapPdatato the effect handler, so an ETW-to-console pipeline receives no data despite this receiver being wired as a source. This needs to build and forward records (or the wiring/sample should not imply downstream output) before the receiver is functional.
// TODO: Convert event data to Arrow record batches
// and forward downstream via effect_handler.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…arrow into Add/etw_data_path
dc6d279 to
08e33ef
Compare
|
@open-telemetry/arrow-approvers The CI Check for |
| /// We use `Mutex<Option<Vec<…>>>` rather than `OnceLock` / `LazyLock` because: | ||
| /// - Initialization is fallible (GUID parsing, thread spawn). | ||
| /// - We need post-init mutation (`Vec::pop`). | ||
| static SESSION: Mutex<Option<Vec<mpsc::Receiver<EtwEventData>>>> = Mutex::new(None); |
There was a problem hiding this comment.
Question: should this singleton track the config it was initialized with? Since SESSION is process-global and the pool is consumed with pop(), a later ETW receiver or restart in the same process may not behave as expected. If full scoped/resettable state is follow-up work, a clear error for mismatched/additional subscriptions would still make this safer.
There was a problem hiding this comment.
The issue with restarts and hot-reload is being tracked in #3033. I think some of what you're saying would also be covered in it.
a clear error for mismatched/additional subscriptions would still make this safer.
Could you explain in more detail what you're looking for here?
There was a problem hiding this comment.
The issue with restarts and hot-reload is being tracked in https://github.com/open-telemetry/otel arrow/issues/3033. I think some of what you're saying would also be covered in it.
Thanks, #3033 covers the restart/shutdown lifecycle part.
Could you explain in more detail what you're looking for here?
What I meant here is only the case where the singleton is already running and another ETW receiver comes in with a different session name or provider list. Today it seems like that second receiver would just reuse the first session. I was wondering if we should detect that and return a clear error, instead of making it look like the second config was applied.
|
|
||
| // `parse_until` blocks on `ProcessTrace`. We never signal stop, | ||
| // so the session runs until the process exits. | ||
| let _result = session.parse_until(&session_name, || false); |
There was a problem hiding this comment.
Can we avoid dropping this result? If ETW fails to start or exits because of permissions/session/provider errors, the receiver currently looks initialized and later just sees a closed channel. At minimum this should log the error, and ideally startup failures should be surfaced before the receiver is considered ready.
|
|
||
| // Best-effort send; if this core's channel is full, | ||
| // drop the event for that core only. | ||
| let _ = txs[i % txs.len()].try_send(data); |
There was a problem hiding this comment.
This seems to drops the ETW event silently when the per-core channel is full. Since this is the main backpressure/loss path in the scaffold, can we record a drop counter or at least log it in a rate-limited way? Also the comment says “for that core only”, but the event is assigned to one core, so this drops it from the pipeline.
| # normal dependency graph and no longer needs to be pulled directly here. The | ||
| # pinned commit is from 2026-04-10. | ||
| one_collect = { git = "https://github.com/microsoft/one-collect.git", rev = "9292caacaddf9ff9e4fbdf77bc62b5ec25494c84", features = ["scripting"], optional = true } | ||
| one_collect = { workspace = true, optional = true } |
There was a problem hiding this comment.
nit - small cleanup
[target.'cfg(any(windows, target_os = "linux"))'.dependencies]
one_collect = { workspace = true, optional = true }| zip = "=8.6.0" | ||
| byte-unit = { version = "5.2.0", features = ["serde"] } | ||
| cpu-time = "1.0.0" | ||
| one_collect = { git = "https://github.com/microsoft/one-collect.git", rev = "cfe3f78" } |
There was a problem hiding this comment.
nit - Can we use the full commit SHA here - would keep this consistent and easier to audit later.
lalitb
left a comment
There was a problem hiding this comment.
LGTM. Since this is an iterative PR, most of my comments can be handled in follow-ups.
e922b73 to
d50c73e
Compare
|
Setting to merge on the hope that fixing #3041 is a fast follow-up to prevent confusing check failures for unrelated PRs. |
5b1d9be
# Change Summary ETW receiver initial PR - Control path hanges - ETW session to consume using one-collect git commit ## What issue does this PR close? * Closes #NNN ## How are these changes tested? Validation using cargo check, run etc. Local validation with ETW console yaml `cargo run --features etw-receiver -- -c configs/etw-console.yaml` ## Are there any user-facing changes? N/A --------- Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com> Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
Change Summary
ETW receiver initial PR
What issue does this PR close?
How are these changes tested?
Validation using cargo check, run etc.
Local validation with ETW console yaml
cargo run --features etw-receiver -- -c configs/etw-console.yamlAre there any user-facing changes?
N/A