Fugu-MT 論文翻訳(概要): Building a Precise Video Language with Human-AI Oversight

論文の概要: Building a Precise Video Language with Human-AI Oversight

arxiv url: http://arxiv.org/abs/2604.21718v2
Date: Sun, 26 Apr 2026 23:28:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:06.933782
Title: Building a Precise Video Language with Human-AI Oversight
Title（参考訳）: 人間-AI監視による精密ビデオ言語の構築
Authors: Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du, Deva Ramanan,
Abstract要約: CHAI (Critique-based Human-AI Oversight) は、訓練された専門家がモデル言語のプレキャプションを改良されたポストキャプションに批判し、修正するフレームワークである。我々の枠組みは、我々の監視フレームワークによって保証された正確さ、リコール、建設性の批判的品質が、下流のパフォーマンスを直接支配していることを示している。
参考スコア（独自算出の注目度）: 63.57293532815199
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/
Abstract（参考訳）: ビデオ言語モデル(VLM)は、自然言語を通して動的視覚世界について推論することを学ぶ。私たちは、正確なビデオキャプションを可能にするスケーラブルな監視のための、オープンデータセット、ベンチマーク、レシピのスイートを紹介します。まず,映画製作者などのプロの映像制作者によって開発された数百の視覚的プリミティブを基礎として,主題,シーン,動き,空間,カメラのダイナミックスを記述するための構造化された仕様を定義する。次に、高品質なキャプションをキュレートするために、モデル生成プレキャプションをモデル生成プレキャプションに改良したフレームワークであるCAI(Citique-based Human-AI Oversight)を紹介する。この作業の分割は、テキスト生成をモデルにオフロードすることで、アノテーションの精度と効率を向上させる。さらに、これらの批判とプレキャプションとポストキャプション間の嗜好は、キャプション生成、報酬モデリング、SFT、DPO、推論時間スケーリングによる批評生成において、オープンソースモデル(Qwen3-VL)を改善するための豊富な監督を提供する。我々の枠組みは、我々の監視フレームワークによって保証された正確さ、リコール、建設性の批判的品質が、下流のパフォーマンスを直接支配していることを示している。控えめな専門家の監督により、結果のモデルはGemini-3.1-Proのようなクローズドソースモデルよりも優れていた。最後に,Wanのような大規模プロ向けビデオ(映画,コマーシャル,ゲームなど)の再撮影と,最大400語までの詳細なプロンプトを追従し,カメラモーション,アングル,レンズ,焦点,視点,フレーミングなど,より微妙な撮影制御を実現するために,我々のアプローチを適用した。以上の結果から,プロレベルの映像理解と生成には,正確な仕様と人間-AIの監視が重要であることが示唆された。データとコードは、プロジェクトのページで利用可能です。

論文の概要: Building a Precise Video Language with Human-AI Oversight

関連論文リスト