Fugu-MT 論文翻訳(概要): Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA

論文の概要: Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA

arxiv url: http://arxiv.org/abs/2509.17743v1
Date: Mon, 22 Sep 2025 13:06:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.401
Title: Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA
Title（参考訳）: 長時間ビデオQAのための適応的高速・低速ビジュアルプログラム推論
Authors: Chenglin Li, Feng Han, FengTao, Ruilin Li, Qianglong Chen, Jingqi Tong, Yin Zhang, Jiaqi Wang,
Abstract要約: 本稿では,適応型視覚プログラム推論手法であるFSVisPRフレームワークを紹介する。単純なクエリの高速推論と難しいクエリの遅い推論のバランスを取る。実験の結果,FS-VisPRは視覚プログラムの効率性と信頼性を両立させることがわかった。
参考スコア（独自算出の注目度）: 36.10720855157895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models' ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.
Abstract（参考訳）: 大規模言語モデル(LLM)は、視覚タスクのためのプログラムワークフローを生成することを約束している。しかし、従来のアプローチは、しばしばクローズドソースモデルに依存し、体系的な推論が欠如し、ビデオQA(英語版)の長文ビデオ質問応答(英語版)に苦慮していた。これらの課題に対処するために,簡単なクエリの高速推論と難解なクエリの遅い推論のバランスをとる適応型ビジュアルプログラム推論手法であるFS-VisPRフレームワークを導入する。まず、長大なビデオタスクをサポートする効率的なビジュアルモジュール(例えば、キークリップ検索、サブタイトル検索)を設計する。そこで我々は,FS-LLMとして視覚プログラムワークフローを生成するオープンソース言語モデルの能力を調整するために,強力なLLMを用いた多種多様な高速スロー推論データセットを構築した。次に、FS-LLMを用いた高速スロー推論フレームワークを設計する: 単純なクエリは、ビデオLLMによって直接解決されるが、難しいクエリは、人間のような推論プロセスによって動機付けられた視覚的プログラム推論を起動する。このプロセスでは、低信頼の迅速な回答が第2段階のスロー推論プロセスを引き起こし、プログラムの実行が失敗すると、高速推論のためのフォールバック機構が起動される。さらに,トレーニングと推論の双方において,パラメータ探索による視覚的プログラムの改善を行う。プログラム内の視覚モジュールのパラメータを調整することで、トレーニング中に正しい回答を得るプログラムが選択され、推論中に最も信頼度の高いプログラムが適用される。実験により、FS-VisPRはビジュアルプログラムワークフローの効率性と信頼性の両方を改善することが示された。 LVBenchの精度は50.4%で、GPT-4oを上回り、ビデオMMEでのQwen2.5VL-72Bの性能に匹敵する。

論文の概要: Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA

関連論文リスト