Fugu-MT 論文翻訳(概要): Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

論文の概要: Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

arxiv url: http://arxiv.org/abs/2506.07811v1
Date: Mon, 09 Jun 2025 14:38:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-10 16:33:11.000336
Title: Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning
Title（参考訳）: Visible Cuesを超越して見る:デュアルクレー推論によるビデオ質問の回答
Authors: Tieyuan Chen, Huabin Liu, Yi Wang, Chaofan Gan, Mingxi Lyu, Gui Zou, Weiyao Lin,
Abstract要約: Video Question Answering (VideoQA) は、与えられたビデオに基づいて自然言語の質問に答えることを目的としている。以前の研究は主に、明示的な視覚的証拠と呼ばれる、関連するセグメントの期間を特定することに焦点を当てていた。 I-VQAを導入し、明確な視覚的証拠がアクセスできないシナリオにおける質問に答えることに焦点を当てた。
参考スコア（独自算出の注目度）: 16.219354963015675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant performance degradation. To fill this gap, we introduce a novel task and dataset, $\textbf{I}$mplicit $\textbf{V}$ideo $\textbf{Q}$uestion $\textbf{A}$nswering (I-VQA), which focuses on answering questions in scenarios where explicit visual evidence is inaccessible. Given an implicit question and its corresponding video, I-VQA requires answering based on the contextual visual cues present within the video. To tackle I-VQA, we propose a novel reasoning framework, IRM (Implicit Reasoning Model), incorporating dual-stream modeling of contextual actions and intent clues as implicit reasoning chains. IRM comprises the Action-Intent Module (AIM) and the Visual Enhancement Module (VEM). AIM deduces and preserves question-related dual clues by generating clue candidates and performing relation deduction. VEM enhances contextual visual representation by leveraging key contextual clues. Extensive experiments validate the effectiveness of our IRM in I-VQA tasks, outperforming GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by $0.76\%$, $1.37\%$, and $4.87\%$, respectively. Additionally, IRM performs SOTA on similar implicit advertisement understanding and future prediction in traffic-VQA. Datasets and codes are available for double-blind review in anonymous repo: https://github.com/tychen-SJTU/Implicit-VideoQA.
Abstract（参考訳）: ビデオ質問回答 (Video Question Answering, VideoQA) は、与えられたビデオに基づいて自然言語の質問に答えることを目的としている。しかし、特に質問が象徴的な意味や深い意図を標的にしている場合、明示的な視覚的証拠は必ずしも直接利用できないため、パフォーマンスが著しく低下する。このギャップを埋めるために、新しいタスクとデータセット、$\textbf{I}$mplicit $\textbf{V}$ideo $\textbf{Q}$uestion $\textbf{A}$nswering (I-VQA)を導入する。暗黙の質問とその対応ビデオが与えられた場合、I-VQAはビデオ内に存在する文脈的な視覚的手がかりに基づいて答える必要がある。 I-VQAに取り組むために、文脈的行動と意図的手がかりの二重ストリームモデリングを暗黙的推論連鎖として組み込んだ新しい推論フレームワークIRM(Implicit Reasoning Model)を提案する。 IRMは、Action-Intent Module (AIM)とVisual Enhancement Module (VEM)で構成されている。 AIMは、手がかり候補を生成し、関係推論を行うことで、質問関連の2つの手がかりを推論し、保存する。 VEMは、重要な文脈的手がかりを活用することにより、文脈的視覚表現を強化する。 GPT-4o, OpenAI-o3, 微調整の VideoChat2 を 0.76 %$, $1.37 %$, 4,87 %$ で上回った。さらに、IRMは、交通VQAにおける同様の暗黙の広告理解と将来の予測に基づいてSOTAを実行する。データセットとコードは匿名リポジトリで二重盲検で確認できる。

論文の概要: Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning

関連論文リスト