Fugu-MT 論文翻訳(概要): SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

論文の概要: SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

arxiv url: http://arxiv.org/abs/2604.08477v1
Date: Thu, 09 Apr 2026 17:16:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.044162
Title: SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
Title（参考訳）: スーパーノバ:自然指導の強化学習によるLLMにおける一般推論の回避
Authors: Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel,
Abstract要約: Reinforcement Learning with Verifiable Rewards (RLVR) は、数学やコードなどの形式領域における大規模言語モデル(LLM)推論を大幅に改善した。 RLVRを一般的な推論に拡張することは、さまざまな推論スキルにまたがる高品質で検証可能なトレーニングデータが欠如していることによって制約される。本稿では,一般推論の強化を目的としたRLVR用データキュレーションフレームワークSUPERNOVAを提案する。
参考スコア（独自算出の注目度）: 17.62959060143886
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) は、数学やコードのような形式的な領域における大きな言語モデル(LLM)推論を大幅に改善した。これらの進歩にもかかわらず、LLMは因果推論や時間的理解といった能力を必要とする一般的な推論タスクに苦慮している。 RLVRを一般的な推論に拡張することは、様々な推論スキルにまたがる高品質で検証可能なトレーニングデータの欠如によって、基本的に制限される。この課題に対処するために,汎用推論の強化を目的としたRLVR用データキュレーションフレームワークであるSUPERNOVAを提案する。我々のキーとなる洞察は、RLVRに体系的に適応できる豊富な推論パターンを、専門家が注釈付けした地上真実を含む命令チューニングデータセットがエンコードしていることです。そこで本研究では,100以上の制御されたRL実験を行い,データ設計の選択が下流の推論性能に与える影響を分析する。特に,3つの要因について検討する。 (i)ソースタスクの選択 (二)タスクミキシング戦略、及び三データ品質改善のための合成介入分析の結果,ソースタスクの選択は非自明であり,下流の推論性能に大きな影響を及ぼすことが明らかとなった。さらに、個々の目標タスクのパフォーマンスに基づいてタスクを選択することは、全体の平均パフォーマンスに基づいて戦略を上回ります。最後に、SUPERNOVAでトレーニングされたモデルは、BBEH、Zebralogic、MMLU-Proといった挑戦的な推論ベンチマークにおいて、強いベースライン(例えばQwen3.5)を上回った。特に、SUPERNOVAのトレーニングでは、モデルサイズをまたいだBBEHの52.8倍の相対的な改善が得られ、RLVRの原理化されたデータキュレーションの有効性が証明された。本研究は,RLVRを一般的な推論にまで拡張するために,人手によるリソースのキュレーションを行うための実践的な知見を提供する。コードとデータはhttps://github.com/asuvarna31/supernovaで公開されている。

論文の概要: SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

関連論文リスト