Fugu-MT 論文翻訳(概要): IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning

論文の概要: IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning

arxiv url: http://arxiv.org/abs/2604.20933v1
Date: Wed, 22 Apr 2026 11:52:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.103591
Title: IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning
Title（参考訳）: IRIS: 大規模言語モデルファインチューニングのための補間的Rényiイテレーティブセルフプレイ
Authors: Wenjie Liao, Like Wu, Liangjie Zhao, Shihui Xu, Shigeru Fujimura,
Abstract要約: セルフプレイの微調整により、人間のアノテーションを追加することなく、教師付き微調整を超えて大きな言語モデルを改善することができる。 IRIS(Interpolative Rényi Iterative Self-play)は,連続的に調整可能な目的を持った,レニイをベースとしたセルフプレイファインチューニングフレームワークである。 Zephyr-7BとQwen2.5-3Bを10のベンチマークで比較したところ、IRISはベースラインを改善し、平均スコアは44.57%に達した。
参考スコア（独自算出の注目度）: 1.4474373238664187
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-play fine-tuning enables large language models to improve beyond supervised fine-tuning without additional human annotations by contrasting annotated responses with self-generated ones. Many existing methods rely on a fixed divergence regime. SPIN is closely related to a KL-based regime, SPACE to a Jensen-Shannon-style objective via noise contrastive estimation, and SPIF to $χ^2$-regularized self-play. Since these divergences exhibit different strengths depending on the distributional gap between model and target, no single choice appears to provide favorable learning dynamics across training stages. We propose IRIS (Interpolative Rényi Iterative Self-play), a Rényi-based self-play fine-tuning framework with a continuously adjustable objective. IRIS decomposes into two independent tilted risk terms over annotated and synthetic data, with exponential importance weights controlled by the order parameter $α$. We show that several self-play objectives can be interpreted as limiting or representative regimes at particular values of $α$, providing a unified theoretical perspective on these methods. An adaptive order schedule further adjusts $α$ to the distributional gap, shifting from sharper importance weighting early in training to smoother refinement near convergence. Theoretically, we establish the fixed-point property of IRIS and analyze how $α$ controls gradient concentration. Experiments on Zephyr-7B and Qwen2.5-3B across ten benchmarks show that IRIS improves upon baselines, reaching 44.57\% average score with gains across iterations. In our setting, IRIS with only 26$k$ annotated samples surpasses standard supervised fine-tuning trained on the full 200$k$ dataset.
Abstract（参考訳）: セルフプレイファインチューニングは、注釈付き応答と自己生成の応答とを対比することにより、人間のアノテーションを追加することなく、教師付きファインチューニングを超えて、大きな言語モデルを改善することができる。既存の多くの手法は、固定された分岐系に依存している。 SPINはKLベースのシステムと密接に関連しており、SPACEはノイズコントラスト推定によってJensen-Shannonスタイルの目的に近づき、SPIFは$ ^2$-regularized self-playとなる。これらの発散は、モデルと対象間の分布的ギャップによって異なる強度を示すため、トレーニング段階全体にわたって良好な学習力学を提供するような単一の選択は存在しない。 IRIS(Interpolative Rényi Iterative Self-play)は,連続的に調整可能な目的を持った,レニイをベースとしたセルフプレイファインチューニングフレームワークである。 IRISは、アノテートおよび合成データに対して2つの独立した傾きリスク項に分解され、指数的重要性重みは順序パラメータ$α$で制御される。いくつかの自己プレー目的を,特定の値が$α$の制限的あるいは代表的レギュレーションと解釈し,これらの手法に関する統一的な理論的視点を提供することを示す。適応順序スケジュールは、トレーニングの初期段階において、より重要度の高い重み付けから、収束近くのスムーズな洗練へとシフトする、分散ギャップへの$αの調整をさらに行う。理論的には、IRISの固定点特性を確立し、α$が勾配濃度を制御する方法を分析する。 Zephyr-7BとQwen2.5-3Bを10のベンチマークで比較したところ、IRISはベースラインを改善し、44.57\%の平均スコアに達した。私たちの設定では、26$k$のアノテーション付きサンプルしか持たないIRISは、200$k$のデータセットでトレーニングされた標準的な教師付き微調整を超えています。

論文の概要: IRIS: Interpolative Rényi Iterative Self-play for Large Language Model Fine-Tuning

関連論文リスト