Fugu-MT 論文翻訳(概要): DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

論文の概要: DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

arxiv url: http://arxiv.org/abs/2508.08589v1
Date: Tue, 12 Aug 2025 03:06:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.281769
Title: DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Title（参考訳）: DocThinker: ドキュメント理解のためのルールベース強化学習による説明可能なマルチモーダル大言語モデル
Authors: Wenwen Yu, Zhibo Yang, Yuliang Liu, Xiang Bai,
Abstract要約: 動的推論時間推論のためのルールベースの強化学習フレームワークであるDocThinkerを提案する。本手法は破滅的な忘れ込みを軽減し,適応性と透明性を両立させる。本研究は,MLLMに基づく文書理解における説明可能性と適応性を高めるための強力な代替手段として,RLに注目した。
参考スコア（独自算出の注目度）: 66.07724324530844
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency. Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more explainable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding. Code will be available at https://github.com/wenwenyu/DocThinker.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は文書理解において顕著な能力を示す。しかし、彼らの推論プロセスはブラックボックスのままであり、特に法的、財政的、医療文書分析のような高度な領域において信頼性と信頼性を確保することは困難である。既存の手法では、教師付き微調整(SFT)を用いた固定されたチェーン・オブ・ソート(CoT)推論を用いるが、破滅的な忘れ込み、適応性の低下、ドメインタスク間の限定的な一般化に悩まされている。本稿では,動的推論時間推論のためのルールベース強化学習(RL)フレームワークであるDocThinkerを提案する。静的なCoTテンプレートに頼る代わりに、DocThinkerはポリシー学習を通じて推論戦略を自律的に洗練し、構造化された推論プロセス、リフレーズされた質問、回答をサポートする関心領域(RoI)、最終回答を含む説明可能な中間結果を生成する。提案手法は,多目的ルールベース報酬とKL制約最適化を統合することにより,破滅的な忘れを軽減し,適応性と透明性を両立させる。複数のベンチマークでの大規模な実験により、DocThinkerはより説明しやすく、人間に理解しやすい推論ステップを生み出しながら、一般化を著しく改善することが示された。本研究は,MLLMに基づく文書理解における説明可能性と適応性を高めるための強力な代替手段として,RLに注目した。コードはhttps://github.com/wenwenyu/DocThinker.comから入手できる。

論文の概要: DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

関連論文リスト