[Bug fix] Avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate by ys2025-AI · Pull Request #9335 · modelscope/ms-swift

ys2025-AI · 2026-05-13T14:06:17Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Problem
When running GRPO with --vllm_mode colocate on Ascend NPU (8 x A2), Accelerator.prepare() triggers _cast_module_to_fp32_for_npu_if_needed() to cast the model to fp32 before FSDP2 sharding. However, in colocate mode the model has already been preloaded onto NPU by vLLM. Casting module.to(torch.float32) on NPU temporarily duplicates the full model (bf16 + fp32), causing OOM on large models like Qwen3-30B-A3B.
Root Cause
The patch assumes the model resides on CPU/meta before prepare(). GRPO colocate breaks this assumption because vLLM initializes the model on NPU first.
Fix
When param.device.type == 'npu', move the model back to CPU, free NPU memory via empty_cache() + synchronize(), then perform the fp32 cast on CPU. FSDP2 will shard the fp32 parameters back to NPU during prepare().
Compatibility
No impact on standard SFT / LoRA / Full fine-tuning where the model stays on CPU before prepare().
Only affects the NPU colocate code path.

[Bug fix] avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate

gemini-code-assist

Code Review

This pull request modifies the NPU FSDP patching logic to move modules to the CPU and clear the NPU cache before casting to FP32, which prevents OOM errors in scenarios like GRPO with vLLM colocation where the model is preloaded on the NPU. Review feedback suggests making this CPU move and cache clearing conditional on whether the parameters are actually residing on the NPU to avoid unnecessary synchronization overhead and potential issues with meta-device initialization.

gemini-code-assist · 2026-05-13T14:07:33Z

+        import torch_npu
+        module = module.cpu()
+        torch_npu.npu.synchronize()
+        torch_npu.npu.empty_cache()


It is recommended to only move the module to CPU and clear the NPU cache if the parameters are currently residing on the NPU. This avoids unnecessary synchronization and cache clearing overhead for standard training paths (like SFT or LoRA) where the model is already on the CPU or meta device. It also ensures better compatibility with meta-device initialization, as calling .cpu() on meta-parameters might lead to unexpected behavior depending on the PyTorch version.

Suggested change

import torch_npu

module = module.cpu()

torch_npu.npu.synchronize()

torch_npu.npu.empty_cache()

if param.device.type == 'npu':

import torch_npu

module = module.cpu()

torch_npu.npu.synchronize()

torch_npu.npu.empty_cache()

ys2025-AI

添加param.device.type == 'npu':判断

ys2025-AI

update

ys2025-AI added 3 commits May 13, 2026 21:45

avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate

dc5332b

Merge pull request #1 from ys2025-AI/fsdp2_npu

9bfc878

[Bug fix] avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate

Merge branch 'modelscope:main' into main

b466051

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

ys2025-AI added 3 commits May 15, 2026 11:15

Update fsdp.py

b48494a

Merge branch 'modelscope:main' into main

1f6ec9c

Update fsdp.py

3e32487

ys2025-AI commented May 19, 2026

View reviewed changes

Update fsdp.py

0c4cbe2

ys2025-AI commented May 19, 2026

View reviewed changes

Jintao-Huang approved these changes May 19, 2026

View reviewed changes

hjh0119 approved these changes May 19, 2026

View reviewed changes

hjh0119 merged commit 7dd6b0e into modelscope:main May 19, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug fix] Avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate#9335

[Bug fix] Avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate#9335
hjh0119 merged 7 commits into
modelscope:mainfrom
ys2025-AI:main

ys2025-AI commented May 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

ys2025-AI left a comment

Uh oh!

ys2025-AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ys2025-AI commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

ys2025-AI left a comment

Choose a reason for hiding this comment

Uh oh!

ys2025-AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ys2025-AI commented May 13, 2026 •

edited

Loading