Fugu-MT 論文翻訳(概要): Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

論文の概要: Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

arxiv url: http://arxiv.org/abs/2512.07461v1
Date: Mon, 08 Dec 2025 11:39:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.864103
Title: Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
Title（参考訳）: ネイティブ並列共振器:自己拡張強化学習による並列性推論
Authors: Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng,
Abstract要約: 我々はNative Parallel Reasoner(NPR)を紹介した。これは、LLM(Large Language Models)が真の並列推論能力を自己発展させることを可能にする、教師なしのフレームワークである。 NPRは、モデルをシーケンシャルエミュレーションから3つの重要な革新を通じてネイティブ並列認識に変換する。
参考スコア（独自算出の注目度）: 68.9332598692234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
Abstract（参考訳）: 我々はNative Parallel Reasoner(NPR)を紹介した。これは、LLM(Large Language Models)が真の並列推論能力を自己進化させることを可能にする教師なしのフレームワークである。 NPRは3つの重要な革新を通じて、モデルをシーケンシャルエミュレーションからネイティブ並列認識に変換する。 1) 「コールドスタート」形式発見から厳格なトポロジカル制約へ、外部の監督なしに移行する自己蒸留プログレッシブトレーニングパラダイム 2)Parallel-Aware Policy Optimization (PAPO)アルゴリズムは、実行グラフ内で直接分岐ポリシーを最適化し、モデルが試行錯誤によって適応的な分解を学習できるようにする。 3) SGLangのメモリ管理とフロー制御をリファクタリングし、安定した大規模並列RLトレーニングを可能にする堅牢なNPRエンジン。 8つの推論ベンチマークで、Qwen3-4BでトレーニングされたNPRは、最大24.5%のパフォーマンス向上と推論のスピードアップを4.6倍に達成している。しばしば自己回帰的復号に回帰する以前のベースラインとは異なり、NPRは100%真の並列実行を示し、自己進化的で効率的でスケーラブルなエージェント推論のための新しい標準を確立している。

論文の概要: Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

関連論文リスト