Skip to content

[Bug fix] Avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate#9335

Merged
hjh0119 merged 7 commits into
modelscope:mainfrom
ys2025-AI:main
May 19, 2026
Merged

[Bug fix] Avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate#9335
hjh0119 merged 7 commits into
modelscope:mainfrom
ys2025-AI:main

Conversation

@ys2025-AI
Copy link
Copy Markdown
Contributor

@ys2025-AI ys2025-AI commented May 13, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Problem
When running GRPO with --vllm_mode colocate on Ascend NPU (8 x A2), Accelerator.prepare() triggers _cast_module_to_fp32_for_npu_if_needed() to cast the model to fp32 before FSDP2 sharding. However, in colocate mode the model has already been preloaded onto NPU by vLLM. Casting module.to(torch.float32) on NPU temporarily duplicates the full model (bf16 + fp32), causing OOM on large models like Qwen3-30B-A3B.
Root Cause
The patch assumes the model resides on CPU/meta before prepare(). GRPO colocate breaks this assumption because vLLM initializes the model on NPU first.
Fix
When param.device.type == 'npu', move the model back to CPU, free NPU memory via empty_cache() + synchronize(), then perform the fp32 cast on CPU. FSDP2 will shard the fp32 parameters back to NPU during prepare().
Compatibility
No impact on standard SFT / LoRA / Full fine-tuning where the model stays on CPU before prepare().
Only affects the NPU colocate code path.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the NPU FSDP patching logic to move modules to the CPU and clear the NPU cache before casting to FP32, which prevents OOM errors in scenarios like GRPO with vLLM colocation where the model is preloaded on the NPU. Review feedback suggests making this CPU move and cache clearing conditional on whether the parameters are actually residing on the NPU to avoid unnecessary synchronization overhead and potential issues with meta-device initialization.

Comment thread swift/model/npu_patch/fsdp.py Outdated
Comment on lines +35 to +38
import torch_npu
module = module.cpu()
torch_npu.npu.synchronize()
torch_npu.npu.empty_cache()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is recommended to only move the module to CPU and clear the NPU cache if the parameters are currently residing on the NPU. This avoids unnecessary synchronization and cache clearing overhead for standard training paths (like SFT or LoRA) where the model is already on the CPU or meta device. It also ensures better compatibility with meta-device initialization, as calling .cpu() on meta-parameters might lead to unexpected behavior depending on the PyTorch version.

Suggested change
import torch_npu
module = module.cpu()
torch_npu.npu.synchronize()
torch_npu.npu.empty_cache()
if param.device.type == 'npu':
import torch_npu
module = module.cpu()
torch_npu.npu.synchronize()
torch_npu.npu.empty_cache()

Copy link
Copy Markdown
Contributor Author

@ys2025-AI ys2025-AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

添加param.device.type == 'npu':判断

Copy link
Copy Markdown
Contributor Author

@ys2025-AI ys2025-AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

@hjh0119 hjh0119 merged commit 7dd6b0e into modelscope:main May 19, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants