-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Support FIPO #9328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Support FIPO #9328
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
e523ef8
fipo support
0941244
megatron fipo support
56b86c3
metrics update
2a5f80d
update docs
2a37644
Merge branch 'main' into fipo
977c0f0
update fipo
506c943
update docs
033ddf5
Refine Future-KL calculation logic
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| # FIPO: Future-KL Influenced Policy Optimization | ||
|
|
||
| [FIPO](https://arxiv.org/abs/2603.19835) 是一种面向长链推理的 value-free RL 方法。它保留 GRPO/DAPO 的整体训练框架,但改变 token 级策略更新的加权方式:不再让一个序列级 advantage 均匀作用到所有 token,而是用折扣累积的 Future-KL 信号判断“从当前 token 开始的后续轨迹”整体是在被增强还是被削弱。 | ||
|
|
||
| ## 核心思想 | ||
|
|
||
| GRPO/DAPO 中,每个 response 的 token 通常共享同一个序列级 advantage: | ||
|
|
||
| $$ | ||
| \hat{A}_{i,t} = \hat{A}_{i} | ||
| $$ | ||
|
|
||
| 这种做法稳定且简单,但 credit assignment 粒度较粗。FIPO 引入当前策略与旧策略在每个 token 上的 log-prob shift: | ||
|
|
||
| $$ | ||
| \Delta \log p_t = \log \pi_\theta(y_t \mid x, y_{<t}) - | ||
| \log \pi_{\mathrm{old}}(y_t \mid x, y_{<t}) | ||
| $$ | ||
|
|
||
| 如果 $\Delta \log p_t > 0$,说明当前训练正在提高该 token 的概率;如果小于 0,则说明该 token 正在被压低。FIPO进一步从当前位置向后折扣累积该信号: | ||
|
|
||
| $$ | ||
| \mathrm{FutureKL}_t = | ||
| \sum_{k=t}^{T}\gamma^{k-t} M_k \Delta \log p_k | ||
| $$ | ||
|
|
||
| 其中 $M_k$ 是 completion mask,$\gamma = 2^{-1 / \text{decay\_rate}}$。`decay_rate` 越大,越远的 future token 对当前位置的影响越强;`decay_rate` 越小,Future-KL 越偏局部。然后将 Future-KL 映射为 influence weight: | ||
|
|
||
| $$ | ||
| f_t = \mathrm{clip}(\exp(\mathrm{FutureKL}_t), 1-\epsilon_f, 1+\epsilon_f) | ||
| $$ | ||
|
|
||
| 最终把原本的 advantage 改成 future-aware advantage: | ||
|
|
||
| $$ | ||
| \tilde{A}_{i,t} = \hat{A}_{i} \cdot f_{i,t} | ||
| $$ | ||
|
|
||
| ## 参数 | ||
|
|
||
| | 参数 | 类型 | 默认值 | 说明 | | ||
| |---------------------------|---------|--------|----------------------------------------------------------------------------------------------------------------| | ||
| | `--loss_type` | `str` | `grpo` | 设置为`fipo` 启用 FIPO loss | | ||
| | `--delta` | `float` | `None` | 启用后会同时用于 Future-KL 高 IS ratio token 过滤和主 loss 的 dual-clip 上限,应大于 `1 + epsilon_high`,对齐FIPO 32B训练脚本建议设置为 `10.0` | | ||
| | `--fipo_decay_rate` | `float` | `32.0` | Future-KL 折扣半衰参数,实际折扣为`2 ** (-1 / fipo_decay_rate)` | | ||
| | `--fipo_clip_range` | `float` | `0.2` | influence weight 裁剪范围;`0.2` 表示默认裁剪到 `[0.8, 1.2]` | | ||
| | `--fipo_clip_high_only` | `bool` | `true` | 若为`true`,权重只裁剪到 `[1.0, 1.0 + fipo_clip_range]`,更偏向放大正 Future-KL | | ||
| | `--fipo_detach_weight` | `bool` | `true` | 是否对 influence weight 截断梯度 | | ||
| | `--fipo_safety_threshold` | `float` | `4.0` | 负 advantage 且 IS ratio 超过该阈值时,将 FIPO 权重限制到 `[0.8, 1.0]` 以避免过度惩罚 | | ||
|
li2zhi marked this conversation as resolved.
|
||
|
|
||
| ## 训练示例 | ||
|
|
||
| [swift](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/fipo.sh) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,6 +6,7 @@ Advanced Research | |
| entropy_mask.md | ||
| CISPO.md | ||
| DAPO.md | ||
| FIPO.md | ||
| deepeyes.md | ||
| GSPO.md | ||
| CHORD.md | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # FIPO: Future-KL Influenced Policy Optimization | ||
|
|
||
| [FIPO](https://arxiv.org/abs/2603.19835) is a value-free RL method for eliciting longer and deeper reasoning. It keeps the GRPO/DAPO training scaffold, but changes how token-level policy updates are weighted: instead of applying one sequence-level advantage uniformly to every token, FIPO uses a discounted Future-KL signal to estimate whether the future trajectory after each token is being reinforced or suppressed. | ||
|
|
||
| ## Core Idea | ||
|
|
||
| In GRPO/DAPO, tokens in the same response usually share the same sequence-level advantage: | ||
|
|
||
| $$ | ||
| \hat{A}_{i,t} = \hat{A}_{i} | ||
| $$ | ||
|
|
||
| This is simple and stable, but the credit assignment is coarse. FIPO starts from the signed log-probability shift between the current policy and the old policy: | ||
|
|
||
| $$ | ||
| \Delta \log p_t = \log \pi_\theta(y_t \mid x, y_{<t}) - | ||
| \log \pi_{\mathrm{old}}(y_t \mid x, y_{<t}) | ||
| $$ | ||
|
|
||
| A positive value means the token probability is being increased by the current update, while a negative value means it is being suppressed. FIPO then accumulates this signal from the current token to the end of the response: | ||
|
|
||
| $$ | ||
| \mathrm{FutureKL}_t = | ||
| \sum_{k=t}^{T}\gamma^{k-t} M_k \Delta \log p_k | ||
| $$ | ||
|
|
||
| where $M_k$ is the completion mask and $\gamma = 2^{-1 / \text{decay\_rate}}$. A larger `decay_rate` gives farther future tokens more influence; a smaller value makes the signal more local. FIPO maps the Future-KL value into a bounded influence weight: | ||
|
|
||
| $$ | ||
| f_t = \mathrm{clip}(\exp(\mathrm{FutureKL}_t), 1-\epsilon_f, 1+\epsilon_f) | ||
| $$ | ||
|
|
||
| The original advantage is then replaced by a future-aware advantage: | ||
|
|
||
| $$ | ||
| \tilde{A}_{i,t} = \hat{A}_{i} \cdot f_{i,t} | ||
| $$ | ||
|
|
||
| ## Parameters | ||
|
|
||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| | Parameter | Type | Default | Description | | ||
| | ------------------------- | ------- | ------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | `--loss_type` | `str` | `grpo` | Set to`fipo` to enable FIPO loss | | ||
| | `--delta` | `float` | `None` | When enabled, it is used for both Future-KL high-IS-ratio token filtering and the main-loss dual-clip upper bound, and should be greater than `1 + epsilon_high`. Set it to `10.0` to match the official 32B script | | ||
| | `--fipo_decay_rate` | `float` | `32.0` | Half-life parameter for Future-KL; the actual discount is`2 ** (-1 / fipo_decay_rate)` | | ||
| | `--fipo_clip_range` | `float` | `0.2` | Influence weight clipping range;`0.2` clips to `[0.8, 1.2]` | | ||
| | `--fipo_clip_high_only` | `bool` | `true` | If`true`, clips the weight to `[1.0, 1.0 + fipo_clip_range]` | | ||
| | `--fipo_detach_weight` | `bool` | `true` | Whether to stop gradients through the influence weight | | ||
| | `--fipo_safety_threshold` | `float` | `4.0` | Caps the FIPO weight to `[0.8, 1.0]` for negative-advantage tokens whose IS ratio exceeds this threshold | | ||
|
|
||
| ## Training Example | ||
|
|
||
| [swift](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/fipo.sh) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,6 +6,7 @@ Advanced Research | |
| entropy_mask.md | ||
| CISPO.md | ||
| DAPO.md | ||
| FIPO.md | ||
| deepeyes.md | ||
| GSPO.md | ||
| CHORD.md | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| CUDA_VISIBLE_DEVICES=2 \ | ||
| swift rollout \ | ||
| --model Qwen/Qwen2.5-1.5B-Instruct | ||
|
|
||
| # 2 GPUS for sequence parallel | ||
| NPROC_PER_NODE=2 \ | ||
| CUDA_VISIBLE_DEVICES=0,1 \ | ||
| swift rlhf \ | ||
| --rlhf_type grpo \ | ||
| --model Qwen/Qwen2.5-1.5B-Instruct \ | ||
| --dataset 'AI-MO/NuminaMath-TIR' \ | ||
| --reward_funcs accuracy \ | ||
| --use_vllm true \ | ||
| --vllm_mode server \ | ||
| --vllm_server_host 127.0.0.1 \ | ||
| --vllm_server_port 8000 \ | ||
| --tuner_type full \ | ||
| --torch_dtype bfloat16 \ | ||
| --load_from_cache_file true \ | ||
| --max_completion_length 4096 \ | ||
| --num_train_epochs 1 \ | ||
| --per_device_train_batch_size 8 \ | ||
| --learning_rate 1e-6 \ | ||
| --gradient_accumulation_steps 2 \ | ||
| --save_total_limit 3 \ | ||
| --save_steps 500 \ | ||
| --logging_steps 1 \ | ||
| --warmup_ratio 0.05 \ | ||
| --dataloader_num_workers 8 \ | ||
| --num_generations 8 \ | ||
| --temperature 1.0 \ | ||
| --system """You are a helpful math assistant. Solve the problem step by step and put your final answer within \\boxed{}.""" \ | ||
| --log_completions true \ | ||
| --num_iterations 3 \ | ||
| --padding_free true \ | ||
| --sequence_parallel_size 2 \ | ||
| --attn_impl flash_attn \ | ||
| --beta 0 \ | ||
| --dynamic_sample true \ | ||
| --loss_type fipo \ | ||
| --delta 10.0 \ | ||
| --epsilon_high 0.28 \ | ||
| --fipo_decay_rate 32 \ | ||
| --fipo_clip_range 0.2 \ | ||
| --fipo_clip_high_only true \ | ||
| --fipo_detach_weight true \ | ||
|
li2zhi marked this conversation as resolved.
Outdated
|
||
| --fipo_safety_threshold 10.0 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.