Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion docs/source/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -620,7 +620,11 @@ reward模型参数将在PPO、GRPO中使用。
- reward_model_plugin: 奖励模型逻辑,默认为orm逻辑, 详细见[自定义奖励模型](./GRPO/DeveloperGuide/reward_model.md#自定义奖励模型)。
- dataset_shuffle: 是否对dataset进行随机操作,默认为True。
- truncation_strategy: 用于处理输入长度超过 max_length 的样本,支持 delete 和 left 两种策略,分别表示删除该样本和从左侧裁剪。默认值为 left。若使用 delete 策略,被删除的超长样本或编码失败的样本将在原数据集中通过重采样进行替换。
- loss_type: loss 归一化的类型,可选项为['grpo', 'bnpo', 'dr_grpo', 'dapo', 'cispo', 'sapo', 'real'], 默认为'grpo', 具体参考[文档](./GRPO/DeveloperGuide/loss_types.md)
- loss_type: loss 归一化的类型,可选项为['grpo', 'bnpo', 'dr_grpo', 'dapo', 'cispo', 'sapo', 'real', 'fipo'], 默认为'grpo', 具体参考[文档](./GRPO/DeveloperGuide/loss_types.md)
- fipo_decay_rate: FIPO Future-KL 折扣半衰参数,实际折扣为`2 ** (-1 / fipo_decay_rate)`,默认值为32.0。
- fipo_clip_range: FIPO influence weight 裁剪范围,默认值为0.2;设置为None或0时不裁剪。
- fipo_clip_high_only: 是否只将FIPO influence weight裁剪到`[1.0, 1.0 + fipo_clip_range]`,默认值为True。
- fipo_safety_threshold: 当负advantage token的IS ratio超过该阈值时,将FIPO influence weight限制到`[0.8, 1.0]`,默认值为4.0。
- log_completions: 是否记录训练中的模型生成内容,搭配 `--report_to wandb/swanlab` 使用。默认为False。
- 提示:若没有设置`--report_to wandb/swanlab`,则会在checkpoint中创建`completions.jsonl`来存储生成内容。
- use_vllm: 是否使用 vLLM 作为 GRPO 生成的 infer_backend,默认为False。
Expand Down
52 changes: 52 additions & 0 deletions docs/source/Instruction/GRPO/AdvancedResearch/FIPO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# FIPO: Future-KL Influenced Policy Optimization

[FIPO](https://arxiv.org/abs/2603.19835) 是一种面向长链推理的 value-free RL 方法。它保留 GRPO/DAPO 的整体训练框架,但改变 token 级策略更新的加权方式:不再让一个序列级 advantage 均匀作用到所有 token,而是用折扣累积的 Future-KL 信号判断“从当前 token 开始的后续轨迹”整体是在被增强还是被削弱。

## 核心思想

GRPO/DAPO 中,每个 response 的 token 通常共享同一个序列级 advantage:

$$
\hat{A}_{i,t} = \hat{A}_{i}
$$

这种做法稳定且简单,但 credit assignment 粒度较粗。FIPO 引入当前策略与旧策略在每个 token 上的 log-prob shift:

$$
\Delta \log p_t = \log \pi_\theta(y_t \mid x, y_{<t}) -
\log \pi_{\mathrm{old}}(y_t \mid x, y_{<t})
$$

如果 $\Delta \log p_t > 0$,说明当前训练正在提高该 token 的概率;如果小于 0,则说明该 token 正在被压低。FIPO进一步从当前位置向后折扣累积该信号:

$$
\mathrm{FutureKL}_t =
\sum_{k=t}^{T}\gamma^{k-t} M_k \Delta \log p_k
$$

其中 $M_k$ 是 completion mask,$\gamma = 2^{-1 / \text{decay\_rate}}$。`decay_rate` 越大,越远的 future token 对当前位置的影响越强;`decay_rate` 越小,Future-KL 越偏局部。然后将 Future-KL 映射为 influence weight:

$$
f_t = \mathrm{clip}(\exp(\mathrm{FutureKL}_t), 1-\epsilon_f, 1+\epsilon_f)
$$

最终把原本的 advantage 改成 future-aware advantage:

$$
\tilde{A}_{i,t} = \hat{A}_{i} \cdot f_{i,t}
$$

## 参数

| 参数 | 类型 | 默认值 | 说明 |
|---------------------------|---------|--------|----------------------------------------------------------------------------------------------------------------|
| `--loss_type` | `str` | `grpo` | 设置为`fipo` 启用 FIPO loss |
| `--delta` | `float` | `None` | 启用后会同时用于 Future-KL 高 IS ratio token 过滤和主 loss 的 dual-clip 上限,应大于 `1 + epsilon_high`,对齐FIPO 32B训练脚本建议设置为 `10.0` |
| `--fipo_decay_rate` | `float` | `32.0` | Future-KL 折扣半衰参数,实际折扣为`2 ** (-1 / fipo_decay_rate)` |
| `--fipo_clip_range` | `float` | `0.2` | influence weight 裁剪范围;`0.2` 表示默认裁剪到 `[0.8, 1.2]` |
| `--fipo_clip_high_only` | `bool` | `true` | 若为`true`,权重只裁剪到 `[1.0, 1.0 + fipo_clip_range]`,更偏向放大正 Future-KL |
| `--fipo_safety_threshold` | `float` | `4.0` | 负 advantage 且 IS ratio 超过该阈值时,将 FIPO 权重限制到 `[0.8, 1.0]` 以避免过度惩罚 |
Comment thread
li2zhi marked this conversation as resolved.

## 训练示例

[swift](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/fipo.sh)
1 change: 1 addition & 0 deletions docs/source/Instruction/GRPO/AdvancedResearch/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Advanced Research
entropy_mask.md
CISPO.md
DAPO.md
FIPO.md
deepeyes.md
GSPO.md
CHORD.md
Expand Down
14 changes: 14 additions & 0 deletions docs/source/Instruction/GRPO/DeveloperGuide/loss_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,20 @@ $$\mathcal{L}_{\text{DAPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_

**归一化维度:** 全局token维度(跨所有进程的completion token总数)

## FIPO

`--loss_type fipo`

FIPO 在 DAPO/GRPO 的 clipped policy loss 上引入 Future-KL influence weight。每个 token 的序列级 advantage 会乘以从当前位置到后续 token 的折扣累积 KL 位移得到的权重:

$$f_{i,t} = \text{clip}\left(\exp\left(\sum_{k=t}^{T_i} \gamma^{k-t} M_{i,k} \Delta \log p_{i,k}\right), 1-\epsilon_f, 1+\epsilon_f\right)$$

$$\mathcal{L}_{i,t}^{\text{FIPO}} = f_{i,t} \cdot \mathcal{L}_{i,t}$$

FIPO 的 influence weight 默认不参与梯度计算,并使用与 DAPO 相同的全局 token 归一化。

**归一化维度:** 全局 token 维度(所有进程的 completion token 总数)

## SAPO

`--loss_type sapo`
Expand Down
6 changes: 5 additions & 1 deletion docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -635,7 +635,11 @@ The meanings of the following parameters can be referenced [here](https://huggin
- reward_model_plugin: The logic for the reward model, which defaults to ORM logic. For more information, please refer to [Customized Reward Models](./GRPO/DeveloperGuide/reward_model.md#custom-reward-model).
- dataset_shuffle: Whether to shuffle the dataset randomly. Default is True.
- truncation_strategy: The method to handle inputs exceeding `max_length`. Supported values are `delete` and `left`, representing deletion and left-side truncation respectively. The default is `left`. With the delete strategy, over-long or encoding-failed samples are discarded, and new samples are resampled from the original dataset to maintain the intended batch size.
- loss_type: The type of loss normalization. Options are ['grpo', 'bnpo', 'dr_grpo', 'dapo', 'cispo', 'sapo', 'real'], default is 'grpo'. For details, refer to this [doc](./GRPO/DeveloperGuide/loss_types.md)
- loss_type: The type of loss normalization. Options are ['grpo', 'bnpo', 'dr_grpo', 'dapo', 'cispo', 'sapo', 'real', 'fipo'], default is 'grpo'. For details, refer to this [doc](./GRPO/DeveloperGuide/loss_types.md)
- fipo_decay_rate: Half-life parameter for FIPO Future-KL. The actual discount is `2 ** (-1 / fipo_decay_rate)`. Default is 32.0.
- fipo_clip_range: Clipping range for the FIPO influence weight. Default is 0.2; set to None or 0 to disable clipping.
- fipo_clip_high_only: Whether to clip the FIPO influence weight to `[1.0, 1.0 + fipo_clip_range]` only. Default is True.
- fipo_safety_threshold: Caps the FIPO influence weight to `[0.8, 1.0]` for negative-advantage tokens whose IS ratio exceeds this threshold. Default is 4.0.
- log_completions: Whether to log the model-generated content during training, to be used in conjunction with `--report_to wandb/swanlab`, default is False.
- Note: If `--report_to wandb/swanlab` is not set, a `completions.jsonl` will be created in the checkpoint to store the generated content.
- use_vllm: Whether to use vLLM as the infer_backend for GRPO generation, default is False.
Expand Down
53 changes: 53 additions & 0 deletions docs/source_en/Instruction/GRPO/AdvancedResearch/FIPO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# FIPO: Future-KL Influenced Policy Optimization

[FIPO](https://arxiv.org/abs/2603.19835) is a value-free RL method for eliciting longer and deeper reasoning. It keeps the GRPO/DAPO training scaffold, but changes how token-level policy updates are weighted: instead of applying one sequence-level advantage uniformly to every token, FIPO uses a discounted Future-KL signal to estimate whether the future trajectory after each token is being reinforced or suppressed.

## Core Idea

In GRPO/DAPO, tokens in the same response usually share the same sequence-level advantage:

$$
\hat{A}_{i,t} = \hat{A}_{i}
$$

This is simple and stable, but the credit assignment is coarse. FIPO starts from the signed log-probability shift between the current policy and the old policy:

$$
\Delta \log p_t = \log \pi_\theta(y_t \mid x, y_{<t}) -
\log \pi_{\mathrm{old}}(y_t \mid x, y_{<t})
$$

A positive value means the token probability is being increased by the current update, while a negative value means it is being suppressed. FIPO then accumulates this signal from the current token to the end of the response:

$$
\mathrm{FutureKL}_t =
\sum_{k=t}^{T}\gamma^{k-t} M_k \Delta \log p_k
$$

where $M_k$ is the completion mask and $\gamma = 2^{-1 / \text{decay\_rate}}$. A larger `decay_rate` gives farther future tokens more influence; a smaller value makes the signal more local. FIPO maps the Future-KL value into a bounded influence weight:

$$
f_t = \mathrm{clip}(\exp(\mathrm{FutureKL}_t), 1-\epsilon_f, 1+\epsilon_f)
$$

The original advantage is then replaced by a future-aware advantage:

$$
\tilde{A}_{i,t} = \hat{A}_{i} \cdot f_{i,t}
$$

## Parameters


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This extra blank line should be removed to maintain consistent formatting with the Chinese version of this document.

| Parameter | Type | Default | Description |
| ------------------------- | ------- | ------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `--loss_type` | `str` | `grpo` | Set to`fipo` to enable FIPO loss |
| `--delta` | `float` | `None` | When enabled, it is used for both Future-KL high-IS-ratio token filtering and the main-loss dual-clip upper bound, and should be greater than `1 + epsilon_high`. Set it to `10.0` to match the official 32B script |
| `--fipo_decay_rate` | `float` | `32.0` | Half-life parameter for Future-KL; the actual discount is`2 ** (-1 / fipo_decay_rate)` |
| `--fipo_clip_range` | `float` | `0.2` | Influence weight clipping range;`0.2` clips to `[0.8, 1.2]` |
| `--fipo_clip_high_only` | `bool` | `true` | If`true`, clips the weight to `[1.0, 1.0 + fipo_clip_range]` |
| `--fipo_safety_threshold` | `float` | `4.0` | Caps the FIPO weight to `[0.8, 1.0]` for negative-advantage tokens whose IS ratio exceeds this threshold |

## Training Example

[swift](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/fipo.sh)
1 change: 1 addition & 0 deletions docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Advanced Research
entropy_mask.md
CISPO.md
DAPO.md
FIPO.md
deepeyes.md
GSPO.md
CHORD.md
Expand Down
14 changes: 14 additions & 0 deletions docs/source_en/Instruction/GRPO/DeveloperGuide/loss_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,20 @@ where:

**Normalization Dimension:** Global token dimension (total completion tokens across all processes)

## FIPO

`--loss_type fipo`

FIPO adds a Future-KL influence weight on top of the DAPO/GRPO clipped policy loss. The sequence-level advantage for each token is weighted by the discounted accumulated KL shift from the current token to future tokens:

$$f_{i,t} = \text{clip}\left(\exp\left(\sum_{k=t}^{T_i} \gamma^{k-t} M_{i,k} \Delta \log p_{i,k}\right), 1-\epsilon_f, 1+\epsilon_f\right)$$

$$\mathcal{L}_{i,t}^{\text{FIPO}} = f_{i,t} \cdot \mathcal{L}_{i,t}$$

The FIPO influence weight is detached by default and uses the same global token normalization as DAPO.

**Normalization Dimension:** Global token dimension (total completion tokens across all processes)

## SAPO

`--loss_type sapo`
Expand Down
46 changes: 46 additions & 0 deletions examples/train/grpo/internal/fipo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
CUDA_VISIBLE_DEVICES=2 \
swift rollout \
--model Qwen/Qwen2.5-1.5B-Instruct

# 2 GPUS for sequence parallel
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen2.5-1.5B-Instruct \
--dataset 'AI-MO/NuminaMath-TIR' \
--reward_funcs accuracy \
--use_vllm true \
--vllm_mode server \
--vllm_server_host 127.0.0.1 \
--vllm_server_port 8000 \
--tuner_type full \
--torch_dtype bfloat16 \
--load_from_cache_file true \
--max_completion_length 4096 \
--num_train_epochs 1 \
--per_device_train_batch_size 8 \
--learning_rate 1e-6 \
--gradient_accumulation_steps 2 \
--save_total_limit 3 \
--save_steps 500 \
--logging_steps 1 \
--warmup_ratio 0.05 \
--dataloader_num_workers 8 \
--num_generations 8 \
--temperature 1.0 \
--system """You are a helpful math assistant. Solve the problem step by step and put your final answer within \\boxed{}.""" \
--log_completions true \
--num_iterations 3 \
--padding_free true \
--sequence_parallel_size 2 \
--attn_impl flash_attn \
--beta 0 \
--dynamic_sample true \
--loss_type fipo \
--delta 10.0 \
--epsilon_high 0.28 \
--fipo_decay_rate 32 \
--fipo_clip_range 0.2 \
--fipo_clip_high_only false \
--fipo_safety_threshold 3.0
6 changes: 6 additions & 0 deletions swift/megatron/arguments/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,12 @@ class RLHFMegatronArgumentsMixin:
# REAL https://arxiv.org/abs/2602.05630
real_tau: float = 0.5

# FIPO https://arxiv.org/abs/2603.19835
fipo_decay_rate: float = 32.0
fipo_clip_range: Optional[float] = 0.2
fipo_clip_high_only: bool = True
fipo_safety_threshold: Optional[float] = 4.0

epsilon: float = 0.2
epsilon_high: Optional[float] = None
delta: Optional[float] = None
Expand Down
Loading
Loading