Fugu-MT 論文翻訳(概要): Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

論文の概要: Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

arxiv url: http://arxiv.org/abs/2601.03948v2
Date: Thu, 08 Jan 2026 02:48:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 13:05:36.790631
Title: Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification
Title（参考訳）: Trade-R1:プロセスレベル推論検証による確率的環境へのブリッジ可能なリワード
Authors: Rui Sun, Yifan Sun, Sheng Xu, Li Zhao, Jing Li, Daxin Jiang, Cheng Hua, Zuo Bai,
Abstract要約: モデルトレーニングフレームワークであるTrade-R1は、プロセスレベルの推論検証を通じて、検証可能な報酬を環境にブリッジする。我々は、得られた証拠、推論連鎖、および決定の間のペアワイズアライメントを評価するために、三角形の整合性指標を構築する。国別資産選択の実験は、我々のパラダイムが報酬ハッキングを減らすことを実証している。
参考スコア（独自算出の注目度）: 35.41216970580546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to achieve remarkable reasoning in domains like mathematics and coding, where verifiable rewards provide clear signals. However, extending this paradigm to financial decision is challenged by the market's stochastic nature: rewards are verifiable but inherently noisy, causing standard RL to degenerate into reward hacking. To address this, we propose Trade-R1, a model training framework that bridges verifiable rewards to stochastic environments via process-level reasoning verification. Our key innovation is a verification method that transforms the problem of evaluating reasoning over lengthy financial documents into a structured Retrieval-Augmented Generation (RAG) task. We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions to serve as a validity filter for noisy market returns. We explore two reward integration strategies: Fixed-effect Semantic Reward (FSR) for stable alignment signals, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization. Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining the highest reasoning consistency.
Abstract（参考訳）: 強化学習(RL)により、数学やコーディングといった分野において、検証可能な報酬が明確な信号を提供する、大きな言語モデル(LLM)が顕著な推論を達成できるようになった。しかし、このパラダイムを金融決定に拡張することは、市場の確率的な性質によって挑戦される:報酬は検証可能であるが本質的にノイズが多く、標準のRLは報酬ハッキングへと退避する。そこで本稿では,プロセスレベルの推論検証を通じて,検証可能な報酬を確率的環境にブリッジするモデルトレーニングフレームワークであるTrade-R1を提案する。我々のキーとなる革新は、長期の財務文書に対する推論を構造化された検索・拡張生成(RAG)タスクに変換する検証手法である。我々は、検索された証拠、推論チェーン、およびノイズの多い市場リターンの妥当性フィルタとして機能する決定間のペアワイズアライメントを評価するために、三角形の整合性指標を構築した。安定なアライメント信号に対する固定効果セマンティック・リワード(FSR)と、結合等級最適化のための動的効果セマンティック・リワード(DSR)の2つの報奨積分戦略を検討する。国によって異なる資産選択実験により,DSRは高い推論一貫性を維持しつつ,市場横断の一般化を達成し,我々のパラダイムが報酬ハッキングを減らすことが実証された。

論文の概要: Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

関連論文リスト