Fugu-MT 論文翻訳(概要): Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

論文の概要: Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

arxiv url: http://arxiv.org/abs/2605.29402v1
Date: Thu, 28 May 2026 05:53:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.772203
Title: Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge
Title（参考訳）: HD-EPIC VQAチャレンジの解法
Authors: Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv, Hui Li,
Abstract要約: 本稿では,長期ビデオ推論を2つの相補的な証拠(意味的証拠と視覚的証拠)に分解する統合的枠組みを提案する。本研究は,意味的および視覚的証拠を明示的に構造化し,検索し,統合することが,MLLMによる映像の理解に重要であることを示す。
参考スコア（独自算出の注目度）: 9.253622130813044
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.
Abstract（参考訳）: マルチモーダルな大言語モデル(MLLM)では、コンテキスト長の制限と細かな視覚的詳細の根拠の不足により、長い形式の自我中心ビデオの理解は依然として困難である。最近提案されたHD-EPICベンチマークでは、これらの制限が強調されている。本稿では,長期ビデオ推論を2つの相補的な証拠である意味的証拠と視覚的証拠に分解する統合的枠組みを提案する。意味的証拠は粗い抽出パイプラインを通してグローバルな手続き的構造を捉え、一方、対象中心の視覚的証拠は境界ボックスと視覚的埋め込みを通してきめ細かな接地を保っている。推論中は、問合せ条件付きエビデンス検索および統合プロセスとして推論を定式化し、両方のソースから関連情報を動的に選択する。提案手法は,複数のタスクカテゴリにわたるHD-EPIC-VQAチャレンジにおいて,競合性能を実現する。より広義には,意味的および視覚的エビデンスを明示的に構造化し,検索し,統合することが,MLLMを用いた効果的な長期映像理解に重要であることを示す。

論文の概要: Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

関連論文リスト