Fugu-MT 論文翻訳(概要): TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

論文の概要: TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

arxiv url: http://arxiv.org/abs/2506.09445v1
Date: Wed, 11 Jun 2025 06:52:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-13 06:35:02.649827
Title: TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
Title（参考訳）: TOGA: タイムリーなオープンエンディングビデオQAと弱みのスーパービジョン
Authors: Ayush Gupta, Anirban Roy, Rama Chellappa, Nathaniel D. Bastian, Alvaro Velasquez, Susmit Jha,
Abstract要約: 本稿では,ビデオ質問応答(ビデオQA)の時間的グラウンド化の問題に対処する。開始と終了の時間に基づいたオープンエンドの回答を生成します。我々はTOGAに回答と時間的接地を共同で生成するように指示する。
参考スコア（独自算出の注目度）: 47.12557166147296
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.
Abstract（参考訳）: 本稿では,ビデオ質問応答 (ビデオQA) の問題に対して,時間的アノテーションを使わずに,時間的根拠を弱めながら対処する。ビデオと質問が与えられたら、開始時間と終了時間に基づくオープンエンドの回答を生成します。そこで本稿では,Weak Supervision を用いた時間的グラウンドド・オープンエンディングビデオ QA のための視覚言語モデル TOGA を提案する。我々はTOGAに回答と時間的接地を共同で生成するように指示する。時間的接地アノテーションが利用できない、弱教師付きセットアップで運用します。我々は、時間的接地のための擬似ラベルを生成し、同じ時間的接地区間を参照する質問によって生成される応答と接地応答との一貫性の制約を付与することにより、これらのラベルの有効性を確保する。グラウンド化で回答を共同生成することで,グラウンド化だけでなく,質問応答の性能も向上することがわかった。我々は, 接地型QAタスクとオープンエンド型QAタスクに基づいてTOGAを評価する。グラウンドドQAについては,弱教師付きグラウンドド質問応答の評価を目的としたNExT-GQAベンチマークを検討する。オープンエンドQAについては、MSVD-QAとActivityNet-QAベンチマークを検討する。これらのベンチマークでは、両方のタスクに対して最先端のパフォーマンスを実現しています。

論文の概要: TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

関連論文リスト