Stabilize vLLM TP rollout all-reduce#9372
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a utility function, configure_vllm_allreduce_env, to manage the VLLM_ALLREDUCE_USE_SYMM_MEM environment variable, defaulting it to '0' for stability during vLLM tensor-parallel rollouts. This utility is integrated into both the Megatron and RLHF rollout trainers. The review feedback suggests improving the robustness of this new function by handling cases where the tensor_parallel_size parameter might be None to prevent potential runtime type errors.
|
Maintenance validation update for #9372: I re-ran the focused validation in an isolated local environment instead of leaving the previous Environment/dependencies installed in
Validation commands run: git diff --check origin/main...HEAD
python -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py tests/utils/test_vllm_env.py
python -m pytest -q tests/utils/test_vllm_env.pyResult: I did not run a full GPU/vLLM rollout job locally; this focused run covers the new |
|
Maintenance validation update for #9372: No source edits were made in this run. I recreated the isolated validation environment because the ledger still had an older Validation environment: python3 /Users/ssr/.codex/bounty-radar/automation/worker_env_preflight.py --repo /Users/ssr/.codex/bounty-radar/bounty-pr-workspace/worker-1/modelscope__ms-swift__issue-8506 --json-out /Users/ssr/.codex/bounty-radar/bounty-pr-workspace/worker-1/modelscope__ms-swift__issue-8506/.bounty-validation/preflight.jsonResult: Validation commands run: git diff --check origin/main...HEAD
.venv-bounty-validation/bin/python -B -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py swift/pipelines/infer/infer.py tests/utils/test_vllm_env.py
.venv-bounty-validation/bin/python -B -m pytest -q tests/utils/test_vllm_env.py
gitleaks detect --no-git --source . --redact --no-banner
gh pr checks 9372 --repo modelscope/ms-swift --watch=falseResults: The remaining inline Gemini note is an outdated thread; the current implementation and |
|
Automated maintenance note for the failed Result:
Validation run locally on the current head python3 /Users/ssr/.codex/bounty-radar/automation/validation_requirements.py --repo /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372 --changed-files-from-git origin/main...HEAD --pr-title "Stabilize vLLM TP rollout all-reduce" --json-out /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372/.bounty-validation/required-validation.json
python3 /Users/ssr/.codex/bounty-radar/automation/worker_env_preflight.py --repo /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372 --max-light-requirements 0 --python-only --json-out /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372/.bounty-validation/preflight.json
git diff --check origin/main...HEAD
.venv-bounty-validation/bin/python -B -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py swift/pipelines/infer/infer.py tests/utils/test_vllm_env.py
.venv-bounty-validation/bin/python -B -m pytest -q tests/utils/test_vllm_env.py
gh pr checks 9372 --repo modelscope/ms-swift --watch=falseResults: |
No description provided.