Fugu-MT 論文翻訳(概要): Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

論文の概要: Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

arxiv url: http://arxiv.org/abs/2605.00610v1
Date: Fri, 01 May 2026 12:20:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.946648
Title: Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors
Title（参考訳）: 統合前の分解:SFTおよびRLVRタスクベクトルのテスト時間合成
Authors: Chaohao Yuan, Chenghao Xiao, Yu Rong, Hong Cheng, Long-Kai Huang,
Abstract要約: タスクベクトルのレンズを用いてSFTとRLVRを解析する。本稿では,SFT と RLVR のチェックポイントを独立してトレーニングできるように,Decoupled Test-time Synthesis (DoTS) を提案する。
参考スコア（独自算出の注目度）: 26.233592394784868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: SFT and RLVR represent two fundamental yet distinct paradigms for LLM post-training, each excelling in distinct dimensions. SFT expands knowledge breadth while RLVR enhances reasoning depth. Yet integrating these complementary strengths remains a formidable challenge. Sequential training can cause catastrophic forgetting, and joint optimization often suffers from severe gradient conflicts. We analyze SFT and RLVR through the lens of task vectors and reveal three structural properties behind these failures: a 30* magnitude disparity, 45* sign interference, and heterogeneous module-wise update distributions. These findings show SFT and RLVR are difficult to integrate directly, but they also suggest that the two paradigms modify partly complementary components of the model. Motivated by these observations, we propose Decoupled Test-time Synthesis (DoTS), a post-hoc framework allows SFT and RLVR checkpoints to be trained independently and synthesizes their capabilities only at inference time via task vector arithmetic, without updating model parameters. To reduce interference, DOTS applies selective sparsification with norm-preserving rescaling. It then uses Bayesian optimization on a small set of unlabeled queries to search for combination coefficients on the Pareto frontier of consistency and perplexity. Empirically, \ours matches or exceeds the performance of training-based SFT--RLVR integration methods across multiple mathematical reasoning benchmarks, incurring only $\sim$3\% of the computational cost. When applied to stronger post-trained checkpoints, DOTS surpasses SOTA models and generalizes to out-of-domain benchmarks without re-tuning. Code is available at https://github.com/chaohaoyuan/DoTS.
Abstract（参考訳）: SFTとRLVRはLLMのポストトレーニングの基本的な2つのパラダイムであり、それぞれが異なる次元で優れている。 SFTは知識の範囲を広げ、RLVRは推論深度を高める。しかし、これらの補完的な強みを統合することは、依然として非常に難しい課題だ。逐次訓練は破滅的な忘れを招き、共同最適化はしばしば深刻な勾配の衝突に悩まされる。我々はタスクベクトルのレンズを通してSFTとRLVRを分析し、これらの障害の背後にある3つの構造的特性を明らかにする。これらの結果から,SFTとRLVRは直接統合が困難であることが示唆された。これらの観測により,SFT と RLVR のチェックポイントを独立に訓練し,モデルパラメータを更新することなく,タスクベクトル演算による推論時にのみ,それらの機能を合成することのできるポストホックフレームワークである Decoupled Test-time Synthesis (DoTS) を提案する。干渉を減らすため、DOTSはノルム保存再スケーリングによる選択的スパーシフィケーションを適用している。その後、ベイジアン最適化をラベルなしクエリの小さなセットで使用し、一貫性とパープレキシティのパレートフロンティア上での組合せ係数を探索する。経験的に、Shaoursは複数の数学的推論ベンチマークにまたがるトレーニングベースのSFT-RLVR統合手法の性能と一致し、計算コストのわずか$\sim$3\%にしかならない。訓練後のより強力なチェックポイントに適用すると、DOTSはSOTAモデルを超え、再チューニングせずにドメイン外のベンチマークに一般化する。コードはhttps://github.com/chaohaoyuan/DoTS.comで入手できる。

論文の概要: Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

関連論文リスト