Stabilize vLLM TP rollout all-reduce by yyswhsccc · Pull Request #9372 · modelscope/ms-swift

yyswhsccc · 2026-05-17T18:59:01Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces a utility function, configure_vllm_allreduce_env, to manage the VLLM_ALLREDUCE_USE_SYMM_MEM environment variable, defaulting it to '0' for stability during vLLM tensor-parallel rollouts. This utility is integrated into both the Megatron and RLHF rollout trainers. The review feedback suggests improving the robustness of this new function by handling cases where the tensor_parallel_size parameter might be None to prevent potential runtime type errors.

yyswhsccc · 2026-05-17T20:06:40Z

Maintenance validation update for #9372:

I re-ran the focused validation in an isolated local environment instead of leaving the previous pytest not installed limitation in place.

Environment/dependencies installed in .venv-pr9372:

pytest
transformers
modelscope
torch macOS arm64 wheel, no CUDA stack
editable project install via python -m pip install -e .

Validation commands run:

git diff --check origin/main...HEAD
python -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py tests/utils/test_vllm_env.py
python -m pytest -q tests/utils/test_vllm_env.py

Result:

4 passed, 2 warnings in 5.16s

I did not run a full GPU/vLLM rollout job locally; this focused run covers the new configure_vllm_allreduce_env behavior and the changed call sites without requiring a multi-GPU vLLM environment.

yyswhsccc · 2026-05-17T22:04:10Z

Maintenance validation update for #9372:

No source edits were made in this run. I recreated the isolated validation environment because the ledger still had an older pytest not installed validation gap.

Validation environment:

python3 /Users/ssr/.codex/bounty-radar/automation/worker_env_preflight.py --repo /Users/ssr/.codex/bounty-radar/bounty-pr-workspace/worker-1/modelscope__ms-swift__issue-8506 --json-out /Users/ssr/.codex/bounty-radar/bounty-pr-workspace/worker-1/modelscope__ms-swift__issue-8506/.bounty-validation/preflight.json

Result: ok: true; created .venv-bounty-validation with /usr/bin/python3, installed pytest, light/project dependencies, and editable ms_swift. Binary-only installs for binpacking, oss2, and transformers_stream_generator were unavailable, but they were not needed for the focused test below.

Validation commands run:

git diff --check origin/main...HEAD
.venv-bounty-validation/bin/python -B -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py swift/pipelines/infer/infer.py tests/utils/test_vllm_env.py
.venv-bounty-validation/bin/python -B -m pytest -q tests/utils/test_vllm_env.py
gitleaks detect --no-git --source . --redact --no-banner
gh pr checks 9372 --repo modelscope/ms-swift --watch=false

Results:

git diff --check: passed
py_compile: passed
pytest: 4 passed, 1 warning in 42.89s
gitleaks: no leaks found
GitHub checks: no checks reported on the PR branch

The remaining inline Gemini note is an outdated thread; the current implementation and test_configure_vllm_allreduce_env_skips_missing_tensor_parallel cover the tensor_parallel_size=None case.

yyswhsccc · 2026-05-18T05:44:40Z

Automated maintenance note for the failed citest-npu / unittest job.

Result:

No source edits were made and nothing was pushed in this pass.
I read the failed GitHub Actions job log before taking action. The failed job is unittest in workflow citest-npu at run 26003394740, job 76451665350.
The failures are in unchanged tuner tests: tests/tuners/test_merged_linear.py::TestMergedLinear::test_swift_lora_forward (torch.allclose(..., atol=1e-4) false) and tests/tuners/test_swift_base.py::TestSwift::test_save_to_peft_ok (torch.isclose(...) false).
The current PR diff only changes the vLLM all-reduce environment helper/call sites and tests/utils/test_vllm_env.py, so I do not see a clear PR-scoped code change for the NPU-only tuner precision failures. Yongshan should make the final call.

Validation run locally on the current head f927d482:

python3 /Users/ssr/.codex/bounty-radar/automation/validation_requirements.py --repo /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372 --changed-files-from-git origin/main...HEAD --pr-title "Stabilize vLLM TP rollout all-reduce" --json-out /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372/.bounty-validation/required-validation.json
python3 /Users/ssr/.codex/bounty-radar/automation/worker_env_preflight.py --repo /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372 --max-light-requirements 0 --python-only --json-out /Users/ssr/.codex/bounty-radar/maintenance-workspace/modelscope__ms-swift__pr9372/.bounty-validation/preflight.json
git diff --check origin/main...HEAD
.venv-bounty-validation/bin/python -B -m py_compile swift/utils/env.py swift/utils/__init__.py swift/megatron/trainers/rollout_mixin.py swift/rlhf_trainers/rollout_mixin.py swift/pipelines/infer/infer.py tests/utils/test_vllm_env.py
.venv-bounty-validation/bin/python -B -m pytest -q tests/utils/test_vllm_env.py
gh pr checks 9372 --repo modelscope/ms-swift --watch=false

Results:

validation checklist: passed; required command is full pytest, recorded below as locally skipped because it maps to the broad repo/GPU/NPU suite rather than the PR-scoped change.
python-only preflight: passed; used `/usr/bin/python3`, pytest already available, editable install passed, broad light requirements intentionally skipped to avoid optional ML stack churn.
git diff --check: passed.
py_compile: passed.
focused pytest: 4 passed, 1 warning in 89.60s.
GitHub checks: `lint` passed, regular `citest / unittest` passed, `citest-npu / unittest` failed as described above.

Stabilize vLLM TP rollout all-reduce

14329c2

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

Comment thread swift/utils/env.py Outdated

Handle missing vLLM tensor parallel size

9c842af

Cover vLLM TP all-reduce default in infer path

f927d48

yyswhsccc closed this May 18, 2026

yyswhsccc deleted the bounty-radar/issue-8506-qwen35-grpo-vllm-hang branch May 18, 2026 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize vLLM TP rollout all-reduce#9372

Stabilize vLLM TP rollout all-reduce#9372
yyswhsccc wants to merge 3 commits into
modelscope:mainfrom
yyswhsccc:bounty-radar/issue-8506-qwen35-grpo-vllm-hang

yyswhsccc commented May 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

yyswhsccc commented May 17, 2026

Uh oh!

yyswhsccc commented May 17, 2026

Uh oh!

yyswhsccc commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yyswhsccc commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yyswhsccc commented May 17, 2026

Uh oh!

yyswhsccc commented May 17, 2026

Uh oh!

yyswhsccc commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yyswhsccc commented May 17, 2026 •

edited

Loading