Fugu-MT 論文翻訳(概要): Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis

論文の概要: Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis

arxiv url: http://arxiv.org/abs/2508.13382v1
Date: Mon, 18 Aug 2025 21:58:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.733207
Title: Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis
Title（参考訳）: Datarus-R1: 自動データ解析のための適応型マルチステップ推論LLM
Authors: Ayoub Ben Chaliah, Hela Dellagi,
Abstract要約: 本稿では,Qwen 2.5-14B-Instructの言語モデルであるDatarus-R1-14Bを提案する。 Datarusは、独立した問合せペアではなく、推論ステップ、コード実行、エラートレース、自己補正、最終的な結論を含む完全な分析トラジェクトリに基づいて訓練されている。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO) featuring KV-cache reuse, sequential generation, and reference-model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL-aligned LLMs. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct-tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain-of-Thought (CoT) traces delimited by <think> and <answer> tags. On demanding postgraduate-level problems, Datarus exhibits an "AHA-moment" pattern: it sketches hypotheses, revises them once or twice, and converges avoiding the circular, token-inflating loops common to contemporary systems. Across standard public benchmarks Datarus surpasses similar size models and even reaches the level of larger reasoning models such as QwQ-32B achieving up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench while emitting 18-49% fewer tokens per solution.
Abstract（参考訳）: 本稿では,Qwen 2.5-14B-Instructから微調整した14パラメータのオープンウェイト言語モデルであるDatarus-R1-14Bについて紹介する。 Datarusは、独立した質問対ではなく、推論ステップ、コード実行、エラートレース、自己補正、最終的な結論を含む完全な分析軌道に基づいて訓練されている。トレーニングパイプラインが組み合わさる一トラジェクトリ中心の合成データ生成装置で、1万四十万回のタグ付けノートを作成したもの (II)単一ステップの音質とエンドツーエンドのコヒーレンスの両方をスコアする階層的リワードモデル(HRM)と軽量タグに基づく構造信号とをブレンドした二重逆フレームワーク。三) KVキャッシュの再利用、逐次生成、参照モデルシャーディングを特徴とするグループ相対政策最適化(GRPO)のメモリ最適化実装。コサインのカリキュラムは、構造的忠実度から意味的深度への強調を円滑にシフトさせ、RLに整列したLLMを悩ませる形式崩壊と冗長性を減少させる。 Datarusの中心的な設計選択は、二重推論インターフェースである。エージェントモードでは、モデルは実際のコードを実行するためにPythonツールを呼び出すReActタグ付きステップを生成する。リフレクションモードでは、<think>タグと<answer>タグで制限されたコンパクトなChain-of-Thought(CoT)トレースを出力する。大学レベルの問題を要求すると、Datarusは仮説をスケッチし、1回または2回修正し、現代のシステムに共通する円形のトークン膨らませループを避けるという"AHA-moment"パターンを示す。標準の公開ベンチマーク全体で、Datarusは同様のサイズモデルを超え、QwQ-32Bのようなより大きな推論モデルレベルまで到達し、AIME 2024/2025とLiveCodeBenchでは最大30%高い精度を達成し、ソリューション当たり18～49%のトークンを出力する。

論文の概要: Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis

関連論文リスト