Skip to content

Stabilize vLLM TP rollout all-reduce#9372

Closed
yyswhsccc wants to merge 3 commits into
modelscope:mainfrom
yyswhsccc:bounty-radar/issue-8506-qwen35-grpo-vllm-hang
Closed

Stabilize vLLM TP rollout all-reduce#9372
yyswhsccc wants to merge 3 commits into
modelscope:mainfrom
yyswhsccc:bounty-radar/issue-8506-qwen35-grpo-vllm-hang

Conversation

@yyswhsccc
Copy link
Copy Markdown

@yyswhsccc yyswhsccc commented May 17, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a utility function, configure_vllm_allreduce_env, to manage the VLLM_ALLREDUCE_USE_SYMM_MEM environment variable, defaulting it to '0' for stability during vLLM tensor-parallel rollouts. This utility is integrated into both the Megatron and RLHF rollout trainers. The review feedback suggests improving the robustness of this new function by handling cases where the tensor_parallel_size parameter might be None to prevent potential runtime type errors.

Comment thread swift/utils/env.py Outdated
@yyswhsccc
Copy link
Copy Markdown
Author

Maintenance validation update for #9372:

I re-ran the focused validation in an isolated local environment instead of leaving the previous pytest not installed limitation in place.

Environment/dependencies installed in .venv-pr9372:

  • pytest
  • transformers
  • modelscope
  • torch macOS arm64 wheel, no CUDA stack
  • editable project install via python -m pip install -e .

Validation commands run:

git diff --check origin/main...HEAD
python -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py tests/utils/test_vllm_env.py
python -m pytest -q tests/utils/test_vllm_env.py

Result:

4 passed, 2 warnings in 5.16s

I did not run a full GPU/vLLM rollout job locally; this focused run covers the new configure_vllm_allreduce_env behavior and the changed call sites without requiring a multi-GPU vLLM environment.

@yyswhsccc
Copy link
Copy Markdown
Author

Maintenance validation update for #9372:

No source edits were made in this run. I recreated the isolated validation environment because the ledger still had an older pytest not installed validation gap.

Validation environment:

python3 /Users/ssr/.codex/bounty-radar/automation/worker_env_preflight.py --repo /Users/ssr/.codex/bounty-radar/bounty-pr-workspace/worker-1/modelscope__ms-swift__issue-8506 --json-out /Users/ssr/.codex/bounty-radar/bounty-pr-workspace/worker-1/modelscope__ms-swift__issue-8506/.bounty-validation/preflight.json

Result: ok: true; created .venv-bounty-validation with /usr/bin/python3, installed pytest, light/project dependencies, and editable ms_swift. Binary-only installs for binpacking, oss2, and transformers_stream_generator were unavailable, but they were not needed for the focused test below.

Validation commands run:

git diff --check origin/main...HEAD
.venv-bounty-validation/bin/python -B -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py swift/pipelines/infer/infer.py tests/utils/test_vllm_env.py
.venv-bounty-validation/bin/python -B -m pytest -q tests/utils/test_vllm_env.py
gitleaks detect --no-git --source . --redact --no-banner
gh pr checks 9372 --repo modelscope/ms-swift --watch=false

Results:

git diff --check: passed
py_compile: passed
pytest: 4 passed, 1 warning in 42.89s
gitleaks: no leaks found
GitHub checks: no checks reported on the PR branch

The remaining inline Gemini note is an outdated thread; the current implementation and test_configure_vllm_allreduce_env_skips_missing_tensor_parallel cover the tensor_parallel_size=None case.

@yyswhsccc
Copy link
Copy Markdown
Author

Automated maintenance note for the failed citest-npu / unittest job.

Result:

  • No source edits were made and nothing was pushed in this pass.
  • I read the failed GitHub Actions job log before taking action. The failed job is unittest in workflow citest-npu at run 26003394740, job 76451665350.
  • The failures are in unchanged tuner tests: tests/tuners/test_merged_linear.py::TestMergedLinear::test_swift_lora_forward (torch.allclose(..., atol=1e-4) false) and tests/tuners/test_swift_base.py::TestSwift::test_save_to_peft_ok (torch.isclose(...) false).
  • The current PR diff only changes the vLLM all-reduce environment helper/call sites and tests/utils/test_vllm_env.py, so I do not see a clear PR-scoped code change for the NPU-only tuner precision failures. Yongshan should make the final call.

Validation run locally on the current head f927d482:

python3 /Users/ssr/.codex/bounty-radar/automation/validation_requirements.py --repo /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372 --changed-files-from-git origin/main...HEAD --pr-title "Stabilize vLLM TP rollout all-reduce" --json-out /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372/.bounty-validation/required-validation.json
python3 /Users/ssr/.codex/bounty-radar/automation/worker_env_preflight.py --repo /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372 --max-light-requirements 0 --python-only --json-out /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372/.bounty-validation/preflight.json
git diff --check origin/main...HEAD
.venv-bounty-validation/bin/python -B -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py swift/pipelines/infer/infer.py tests/utils/test_vllm_env.py
.venv-bounty-validation/bin/python -B -m pytest -q tests/utils/test_vllm_env.py
gh pr checks 9372 --repo modelscope/ms-swift --watch=false

Results:

validation checklist: passed; required command is full pytest, recorded below as locally skipped because it maps to the broad repo/GPU/NPU suite rather than the PR-scoped change.
python-only preflight: passed; used `/usr/bin/python3`, pytest already available, editable install passed, broad light requirements intentionally skipped to avoid optional ML stack churn.
git diff --check: passed.
py_compile: passed.
focused pytest: 4 passed, 1 warning in 89.60s.
GitHub checks: `lint` passed, regular `citest / unittest` passed, `citest-npu / unittest` failed as described above.

@yyswhsccc yyswhsccc closed this May 18, 2026
@yyswhsccc yyswhsccc deleted the bounty-radar/issue-8506-qwen35-grpo-vllm-hang branch May 18, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant