Fugu-MT 論文翻訳(概要): POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

論文の概要: POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

arxiv url: http://arxiv.org/abs/2510.01009v1
Date: Wed, 01 Oct 2025 15:15:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 14:32:17.214817
Title: POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Title（参考訳）: POVQA: データ効率の合理化による優先最適化ビデオ質問への回答
Authors: Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi,
Abstract要約: 近年,VQAタスクには1500以上のフレームのコンテキストウィンドウが設けられている。データ効率のよいパイプラインであるPOVQAを導入し、ビデオの各秒を1つの時間プール画像に圧縮する。
参考スコア（独自算出の注目度）: 3.4998703934432682
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.
Abstract（参考訳）: ビデオ質問回答(VQA)とLVLM(Large Vision Language Models)は、DeepmindによってFlamingoが導入されて以来、研究において大きな注目を集めている。近年,VQAタスクには1500以上のフレームのコンテキストウィンドウが設けられている。しかしこれは、重要な情報を失うことなく、ビデオの50秒にしか至らない。データ効率のよいパイプラインであるPOVQAを導入し、ビデオの各秒を1つの時間的にプールした画像に圧縮し(動きのぼやけと重み付き平均変動)、LVLMを軽量に調整する。具体的には、Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling, fine-tune QWEN-2.5-VL 7B with supervised two turn target with the reasoning and final answer。提案手法は,新たなデータセットReasonVQAにスーパーバイザード・ファイン・チューニング (SFT) とダイレクト・プライス・オプティマイゼーション (DPO) を適用した。 F1スコアは0.212から0.543、BLEU-4は0.031から0.291、ROUGE-Lは0.196から0.528である。ランタリーの品質も著しく向上した。各種プール機能におけるSFT + DPOの相互評価は、列車や試験時間におけるプール方式にかかわらず利得が持続していることを示し、時間的証拠の要約に強い堅牢性を示す。同様の観測はTVQAでゼロショットで行われた。

論文の概要: POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

関連論文リスト