Skip to content

[bugfix] Skip fps append in vllm mode for Qwen2.5-VL video (#9357)#9373

Open
yushuosun wants to merge 1 commit into
modelscope:mainfrom
yushuosun:claude/trusting-ptolemy-i0ALe
Open

[bugfix] Skip fps append in vllm mode for Qwen2.5-VL video (#9357)#9373
yushuosun wants to merge 1 commit into
modelscope:mainfrom
yushuosun:claude/trusting-ptolemy-i0ALe

Conversation

@yushuosun
Copy link
Copy Markdown

@yushuosun yushuosun commented May 17, 2026

Motivation

Fixes #9357.

Running swift infer --infer_backend vllm on Qwen/Qwen2.5-VL-3B-Instruct
with a video dataset crashes during prompt rendering under
transformers v5 / latest huggingface_hub:

File "huggingface_hub/dataclasses.py", line 144, in __strict_setattr__
    validator(value)
File "huggingface_hub/dataclasses.py", line 625, in validator
    type_validator(field.name, value, field.type)
File "huggingface_hub/dataclasses.py", line 482, in type_validator
    type_validator(name, value, args[0])
TypeError: ... fps must be a scalar, got list ...

tf backend works; only vllm is broken.

Root cause

swift/template/templates/qwen.py:347 — the Qwen2VLTemplate.replace_tag
path for version == 'v2_5' appends the per-video video_kwargs
(a dict that includes fps) into inputs.mm_processor_kwargs['fps']
unconditionally:

inputs.mm_processor_kwargs.setdefault('fps', []).append(video_kwargs)

In vLLM mode this list of dicts is then forwarded to the new
huggingface_hub strict dataclass validator inside the HF processor,
which expects fps to be a scalar and rejects the list.

The neighbouring branches already special-case vLLM:

elif self.version == 'v3':
    if self.mode != 'vllm':
        video, video_metadata = ...
elif self.version == 'omni':
    if self.mode != 'vllm':
        ...

The 'v2_5' branch was simply missing the same guard.

Modifications

swift/template/templates/qwen.py — wrap the 'v2_5' branch's
mm_processor_kwargs['fps'].append(...) in if self.mode != 'vllm':,
matching the v3 / omni pattern (+2 / -1):

 if self.version == 'v2_5':
-    inputs.mm_processor_kwargs.setdefault('fps', []).append(video_kwargs)
+    if self.mode != 'vllm':
+        inputs.mm_processor_kwargs.setdefault('fps', []).append(video_kwargs)
 elif self.version == 'v3':

Net diff: +2 / -1 lines in one file. The tf backend path is
unchanged.

)

In vllm mode, Qwen2VLTemplate.replace_tag was passing the local
fps probe (a list) through mm_processor_kwargs to vllm's
Qwen2_5_VLProcessor, which under transformers v5 validates fps as
scalar (int|float|None) and rejects the list with
StrictDataclassFieldValidationError.

The v3 branch immediately below already guards 'video_metadata' with
'if self.mode != "vllm":' for the same reason. Apply the same guard
to the v2_5 fps append so vllm computes fps itself from the video
input.

The non-vllm _encode path is unaffected: it still receives fps in
mm_processor_kwargs to compute second_per_grid_ts.

Fixes modelscope#9357

Co-authored-by: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 17, 2026 23:57
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Qwen template in swift/template/templates/qwen.py to ensure that video frame rate metadata is only appended to mm_processor_kwargs when the mode is not 'vllm' for version 2.5. This change ensures consistency with the logic used for version 3. There are no review comments to address, and I have no further feedback to provide.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a vLLM-only crash when rendering prompts for Qwen2.5-VL with video inputs under newer transformers/huggingface_hub strict validation, by avoiding passing an incompatible fps structure via mm_processor_kwargs in vLLM mode.

Changes:

  • Skip appending per-video fps data into inputs.mm_processor_kwargs when self.mode == 'vllm' for the version == 'v2_5' branch.
  • Align v2_5 behavior with existing v3 handling that already guards vLLM from similar multimodal kwargs mutations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Tohrusky
Copy link
Copy Markdown
Contributor

I'm a bit concerned whether video_metadata is correctly passed in vllm mode. Could we print the RoPE-related or something to verify the propagation when set a FPS=24?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Qwen2.5-VL vllm infer video failed on transformers v5

3 participants