fix(dataloader): switch to forkserver context + add timeout + explicit teardown (closes #140, partial #145)#148
fix(dataloader): switch to forkserver context + add timeout + explicit teardown (closes #140, partial #145)#148mihow wants to merge 2 commits into
Conversation
…t teardown Closes #140, partial #145. ## What this fixes (plain language) The ML worker that pulls jobs off the Antenna queue has a memory leak in its PyTorch DataLoader. Each batch leaves behind a few megabytes of shared memory that PyTorch never reclaims. On big jobs the leak adds up fast: a routine 400-image job recently used up the entire host's RAM in under 20 minutes and got killed by the OS, taking ~half its images down with it. Two small DataLoader-config changes plus an explicit cleanup at end of job are enough to make worker memory stay roughly flat across batches instead of climbing until the kernel kills the process. ## The three changes 1. **`multiprocessing_context="forkserver"`** (new env var `AMI_ANTENNA_API_DATALOADER_MP_CONTEXT`, default `forkserver`). The Linux default is `fork`, which makes each DataLoader subprocess inherit the parent's entire memory state via copy-on-write. When the parent has large CUDA buffers and pinned memory loaded, every subprocess carries that state as "shared memory" and PyTorch sometimes can't clean it up on shutdown. `forkserver` re-execs subprocesses from a clean intermediate state, sidestepping the inheritance race. 2. **`timeout=300`** (new env var `AMI_ANTENNA_API_DATALOADER_TIMEOUT_S`, default 300s). When a DataLoader subprocess dies mid-shared-memory cleanup, the main thread currently blocks forever on `multiprocessing.Queue.get()`. A timeout converts that silent hang into a `RuntimeError` the supervisor can restart from. 3. **Explicit teardown in `_process_job`** — `del loader; gc.collect(); torch.cuda.empty_cache()` in the `finally:` block. Drops the DataLoader reference immediately instead of relying on the garbage collector to run `__del__` at some indeterminate later time. Also addresses the pipe-FD accumulation pattern observed in #145. All three are conservative — operators can revert each one individually via env vars without a code change. ## Out of scope - **`pin_memory` default not changed.** #138 and #122 disagree on the right default for HTTP-sourced workloads; that needs its own measured ablation, not a guess buried in this PR. - LRU model eviction (#144 Part A) and the worker drain-and-exit recycler (#147) are separate follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughAdds Settings and example env entries for DataLoader multiprocessing context and timeout, wires those into get_rest_dataloader(), explicitly tears down DataLoader/batch iterator on worker shutdown, and adds regression tests for defaults, overrides, validation, and backwards compatibility. ChangesDataLoader Subprocess Hygiene
Sequence Diagram(s)No sequence diagram generated for this change. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@trapdata/antenna/datasets.py`:
- Around line 453-458: The code currently validates mp_context against a
hardcoded tuple and then calls mp.get_context(mp_context); replace that with a
platform-aware check using multiprocessing.get_all_start_methods(): compute
allowed = set(mp.get_all_start_methods()), treat empty/None "" as allowed
default, and if mp_context is provided ensure mp_context is in allowed; if not,
raise ValueError that includes the actual supported methods (allowed) and the
invalid value; only call mp.get_context(mp_context) when mp_context is non-empty
so dataloader_mp_context is set safely.
In `@trapdata/antenna/tests/test_dataloader_hygiene.py`:
- Line 15: Remove the unused import of multiprocessing from the module: delete
the top-level import line "import multiprocessing" in the test file (the unused
symbol multiprocessing is triggering Flake8 F401), ensuring no other code
references multiprocessing remains; run the test/flake checks afterwards to
confirm the warning is resolved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0be389c7-0408-45bd-a2b8-ef0a1c174c1c
📒 Files selected for processing (5)
.env.exampletrapdata/antenna/datasets.pytrapdata/antenna/tests/test_dataloader_hygiene.pytrapdata/antenna/worker.pytrapdata/settings.py
…sed import CodeRabbit feedback on PR #148: - datasets.py: replace hardcoded ("fork","spawn","forkserver") tuple with multiprocessing.get_all_start_methods() so the validation matches what the host actually supports (e.g. macOS lacks "fork" by default in Python 3.13+). - test_dataloader_hygiene.py: drop unused `import multiprocessing` (Flake8 F401). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude says: Addressed both CodeRabbit findings in 0083ae8:
Local sanity check on arctia (Python 3.12.13):
CI failures — empirical pre-existence checkRan the 3 failing tests on both
The two Happy to file a follow-up issue for both classes of failure if useful — the Wikipedia one needs a fixture-URL change, the |
Plain-language summary
Our ML workers — the "ami workers" that pull jobs off the Antenna queue and run detection + classification on bug images — have a long-running memory leak in the PyTorch DataLoader. Each batch leaves behind a few megabytes of shared memory that PyTorch never reclaims. On big jobs the leak adds up fast: a routine 400-image production job recently used up the entire host's RAM in under 20 minutes and got OOM-killed, which dropped about half its images on the floor.
This PR ships three small, contained changes to how the DataLoader is configured plus an explicit cleanup at end of job. Together they should make worker memory stay roughly flat across batches instead of climbing until the kernel kills the process. All three knobs are env-configurable so an operator can opt any single host out if it surfaces a problem with a specific pipeline.
Motivation (2026-05-19 incident with numbers): #140 (comment)
What changes
1.
multiprocessing_context="forkserver"on the DataLoaderThe Linux default is
fork, which makes each DataLoader subprocess inherit the parent's entire memory state via copy-on-write. When the parent process already has large CUDA buffers and pinned memory loaded (a normal state for a running ML worker), every DataLoader subprocess carries that state along as "shared memory" — and PyTorch sometimes can't clean it up on subprocess shutdown.forkserverre-execs subprocesses from a clean intermediate state, sidestepping both the inheritance and the cleanup race.AMI_ANTENNA_API_DATALOADER_MP_CONTEXTforkserverfork,spawn,forkserver, or empty (let PyTorch pick its default)This is the fix listed in both #140 and #145 that nobody has tried in production yet.
2.
timeout=300on the DataLoaderWhen the shared-memory cleanup race in #140 hits, the main thread currently hangs forever in
multiprocessing.Queue.get()waiting for a subprocess that's already dead. A non-zerotimeoutconverts the silent hang into aRuntimeErrorthe worker can log, restart from, and move on. Zero risk to the happy path — workers normally finish a batch in seconds, not minutes.AMI_ANTENNA_API_DATALOADER_TIMEOUT_S3. Explicit teardown in
_process_jobIn the
finally:block of_process_job, after theresult_poster.shutdown()call:Right now the DataLoader gets dropped implicitly when the function returns. That can leave child processes' shared-memory segments and pipe FDs alive for an unpredictable amount of time. Explicit cleanup tells Python to release references immediately and asks PyTorch to flush its GPU buffer cache. Cheap insurance regardless of which
multiprocessing_contextends up active, and directly addresses the per-batch pipe-FD accumulation pattern observed in #145.Verification on antenna-dev-arctia (H100 24GB MIG, ~68 GiB host RAM)
Supervisor
ami-worker(2 procs,--num_workers 1,AMI_NUM_WORKERS=1) restarted withPYTHONPATHpointing at the patched worktree. Ran three back-to-backtest_ml_job_e2ejobs againstquebec_vermont_moths_2023:Zero image failures across ~360 processed images on the exact pipeline + parallel-2-workers configuration that lost 49% of images on production on 2026-05-19. No OOM kills in
dmesg. Both workers stayedRUNNINGthroughout.RSS / shmem-rss trajectory (sampled every 3s during the stress run):
For comparison, the 2026-05-19 production incident saw single-worker peaks of 17–35 GB shmem-rss on the same pipeline before the OS killed the process. The catastrophic per-job blow-up is gone; the residual growth that remains across multiple jobs is the territory of #144 Part B (out of scope here).
DataLoader subprocesses cleaned up cleanly after each job —
pgrep -c pt_data_workerreturned 0 between jobs.Test suite on the worktree: 21/21 passed (
pytest trapdata/antenna/tests/), including the new 8-testtest_dataloader_hygiene.pyregression suite that guards the configuration surface so a future refactor can't silently remove the fix.Pipeline-pickling smoke test
forkserverpickles the dataset class when spawning subprocesses. Verified manually thatget_rest_dataloader()withnum_workers=2+forkserverspawns cleanly usingRESTDataset(the shared dataset class for all pipelines). End-to-end run above usedquebec_vermont_moths_2023. Other pipelines (panama, uk_denmark, costa_rica_turing, kenya-uganda_turing, anguilla_turing, global, insect_orders, moth_binary) share the sameRESTDatasetso the pickling path is the same — to be exercised in the staged rollout.Out of scope (deliberate)
pin_memorydefault not touched. Clarify and document tuning parameters #138 and GPU utilization fixes #122 disagree on the right default for HTTP-sourced workloads. That needs its own measured ablation, not a guess buried in this PR.AMI_NUM_WORKERSenv-var reuse cleanup (Clarify and document tuning parameters #138 item 8) — doc-only, deferrable.Relationship to other open work
Orthogonal to PR #139 (env-knob plumbing on
feature/uv-migration). This PR is based onmainand does not modify the same files in conflicting ways.Staged rollout plan (post-merge)
dmesg,ps -o pid,rss,cmd -C ami,pgrep -c pt_data_workerfor 30+ minutes.RuntimeError: could not unlink ...events appear, roll to the rest of the fleet.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Refactor
Tests