Fugu-MT 論文翻訳(概要): MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

論文の概要: MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

arxiv url: http://arxiv.org/abs/2605.21917v1
Date: Thu, 21 May 2026 02:44:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.063861
Title: MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
Title（参考訳）: MAVEN:ビデオ推論タスクのための多段階エージェントアノテーションパイプライン
Authors: Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, Vidya Murali,
Abstract要約: 我々は、生動画をChain-of-Thought(CoT)推論トレースを用いたマルチタスクトレーニングデータに変換するマルチステージエージェントパイプラインであるMAVENを提案する。 MAVENはエージェント駆動のドメイン適応をサポートしており、新しいビデオデータセットとターゲットの質問例を与えられたエージェントは、手動のリエンジニアリングなしでトップダウンのプロンプトを再設計する。我々はMAVENを5,300本以上のトラヒックビデオと微動コスモス・レーソン2-8Bのラベル付けに応用した。
参考スコア（独自算出の注目度）: 3.6322796145178167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.
Abstract（参考訳）: ビデオイベント推論のためのビジョン言語モデル(VLM)のトレーニングには、何が起きたかだけでなく、いつ、どこで、なぜ、そして何故、大規模な手動ラベリングがサポートできないのかをキャプチャする高品質な構造化アノテーションが必要である。提案するMAVEN(Multi-stage Agentic Video Event aNnotation)は,多段階のエージェントパイプラインで,生動画をChain-of-Thought(CoT)推論トレースを用いてマルチタスクトレーニングデータに変換する。 MAVENは3つの補完的なキャプションレベルからマルチスケールの時空間イベント記述(MSTED)を合成する。重要なことに、MAVENはエージェント駆動のドメイン適応をサポートしている。新しいビデオデータセットとターゲットの質問例が与えられた場合、エージェントは手動のリエンジニアリングなしでトップダウンのプロンプトを再設計する。階層的改善ループは、分類学に対するアノテーションエラーをさらに分類し、根本原因をパイプラインステージにトレースし、パイプライン構造自体を書き換えたり修正したりするターゲット編集を適用し、データ品質を反復的に改善する。我々はMAVENを5,300本以上のトラヒックビデオと微動コスモス・レーソン2-8Bのラベル付けに応用した。プライベートCCTV評価セットでは、微調整がGemini 2.5 Proと3.1 Flashを上回り、0ショット以上のMCQ精度が+38.8$ポイント向上した。 AccidentBenchでは、CCTVのみのトレーニングがCosmos-Reason2を+10.7ドルのMCQポイントで持ち上げ、ダッシュカムビデオがないにもかかわらずGemini 2.5 Proとマッチする。倉庫の監視と安全ビデオの質的な結果はさらに、エージェントワークフローがパイプラインを新しいドメインに容易に適応させることを示している。

論文の概要: MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

関連論文リスト