Fugu-MT 論文翻訳(概要): URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

論文の概要: URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

arxiv url: http://arxiv.org/abs/2511.10552v1
Date: Fri, 14 Nov 2025 01:57:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-14 22:53:22.923838
Title: URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding
Title（参考訳）: URaG: 効率的な長期文書理解のためのマルチモーダルLLMの統一検索と生成
Authors: Yongxin Shi, Jiapeng Wang, Zeyu Shan, Dezhi Peng, Zening Lin, Lianwen Jin,
Abstract要約: MLLM内での検索と生成を統一するフレームワークであるURaGについて述べる。 URaGは,計算オーバーヘッドを44～56%削減し,最先端性能を実現する。
参考スコア（独自算出の注目度）: 55.45331924836242
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.
Abstract（参考訳）: 最近のマルチモーダル大規模言語モデル(MLLM)は、豊富な無関係コンテンツからの情報干渉とトランスフォーマーベースのアーキテクチャの2次計算コストという2つの根本的な課題のために、長い文書理解に苦慮している。既存のアプローチは主に、細かな詳細を犠牲にするトークン圧縮と、システムの複雑さを高め、エンドツーエンドの最適化を防ぐ外部レトリバーの2つのカテゴリに分類される。これらの問題に対処するため、我々は詳細な分析を行い、MLLMが人間のような粗い推論パターンを示すことを観察する。この知見により,MLLMの本質的なエビデンスローカライゼーション能力は,推論過程における検索を明示的に活用し,より効率的な長期文書理解を容易にすることができると仮定した。この目的のために,単一MLLM内での検索と生成を統一する,シンプルなyet- EffectiveフレームワークであるURaGを提案する。 URaGは軽量なクロスモーダル検索モジュールを導入し、初期のトランスフォーマー層を効率的なエビデンスセレクタに変換し、無関係なコンテンツを破棄しながら、最も関連性の高いページを特定し保存する。この設計により、深い層が関連する情報に計算資源を集中させ、精度と効率の両方を改善することができる。 URaGは、計算オーバーヘッドを44～56%削減し、最先端の性能を達成することを実証した。コードはhttps://github.com/shi-yx/URaGで公開されている。

論文の概要: URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

関連論文リスト