Fugu-MT 論文翻訳(概要): OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

論文の概要: OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

arxiv url: http://arxiv.org/abs/2604.13073v1
Date: Fri, 20 Mar 2026 17:25:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.659725
Title: OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
Title（参考訳）: OmniTrace:Omni-Modal LLMにおける生成時間属性の統一フレームワーク
Authors: Qianqi Yan, Yichen Guo, Ching-Chen Kuo, Shan Jiang, Hang Yin, Yang Zhao, Xin Eric Wang,
Abstract要約: 我々は、Attributionを世代追跡問題として形式化する軽量でモデルに依存しないフレームワークであるOmniTraceを紹介した。本研究では, 世代別スパンレベルの属性が, 自己帰属よりも安定かつ解釈可能な説明をもたらすことを示す。この結果から,属性を構造化された生成時トレース問題として扱うことは,オムニモーダル言語モデルにおける透明性のスケーラブルな基盤となることが示唆された。
参考スコア（独自算出の注目度）: 31.589945976149973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern multimodal large language models (MLLMs) generate fluent responses from interleaved text, image, audio, and video inputs. However, identifying which input sources support each generated statement remains an open challenge. Existing attribution methods are primarily designed for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only models performing open-ended multimodal generation. We introduce OmniTrace, a lightweight and model-agnostic framework that formalizes attribution as a generation-time tracing problem over the causal decoding process. OmniTrace provides a unified protocol that converts arbitrary token-level signals such as attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. It traces each generated token to multimodal inputs, aggregates signals into semantically meaningful spans, and selects concise supporting sources through confidence-weighted and temporally coherent aggregation, without retraining or supervision. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, while remaining robust across multiple underlying attribution signals. Our results suggest that treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models.
Abstract（参考訳）: 現代のマルチモーダル大言語モデル(MLLM)は、インターリーブされたテキスト、画像、オーディオ、ビデオ入力から流動的な応答を生成する。しかし、どの入力ソースが生成されたステートメントをサポートするかは、依然としてオープンな課題である。既存の属性法は主に分類設定、固定予測ターゲット、単一モダリティアーキテクチャ用に設計されており、自然にオープンなマルチモーダル生成を行う自動回帰デコーダのみのモデルに拡張されない。我々はOmniTraceを紹介した。OmniTraceは軽量でモデルに依存しないフレームワークで、因果復号処理に対する世代追跡問題として属性を形式化する。 OmniTraceは、注意重みや勾配に基づくスコアなどの任意のトークンレベルの信号をデコード中のコヒーレントなスパンレベルのクロスモーダルな説明に変換する統一プロトコルを提供する。生成されたトークンをマルチモーダル入力にトレースし、シグナルを意味のあるスパンに集約し、信頼度重み付けと時間的に整合したアグリゲーションを通じて簡潔なサポートソースを選択する。 Qwen2.5-OmniとMiniCPM-o-4.5を視覚、音声、ビデオのタスクで評価したところ、生成を意識したスパンレベルの属性は、単純な自己属性や埋め込みベースのベースラインよりも安定的で解釈可能な説明を生成する一方で、複数の基礎的属性信号に対して頑健なままであることが示された。この結果から,属性を構造化された生成時トレース問題として扱うことは,オムニモーダル言語モデルにおける透明性のスケーラブルな基盤となることが示唆された。

論文の概要: OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

関連論文リスト