Fugu-MT 論文翻訳(概要): VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

論文の概要: VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

arxiv url: http://arxiv.org/abs/2507.13353v1
Date: Thu, 17 Jul 2025 17:59:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-18 20:10:24.6224
Title: VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
Title（参考訳）: VideoITG: テンポラルグラウンドによるマルチモーダルビデオ理解
Authors: Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Li, Jose M. Alvarez, Lei Zhang, Zhiding Yu,
Abstract要約: Instructed Temporal Grounding for Videos (VideoITG) を提案する。 VideoITGの中核は、人間のアノテーションプロセスを明示的に模倣する自動アノテーションフレームワークであるVidThinkerパイプラインである。我々は,複数のマルチモーダルビデオ理解ベンチマークにおいて,ビデオITGが一貫した性能向上を実現していることを示す。
参考スコア（独自算出の注目度）: 45.83476222676765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.
Abstract（参考訳）: 近年,ビデオ大言語モデル (Video Large Language Models, Video-LLMs) の性能向上に寄与することが報告されている。フレーム間の冗長性の低減、画像テキスト関連性評価に別々のモデルの採用、イベントローカライゼーションに時間的ビデオグラウンドを利用するといった現在の手法は、教師なし学習パラダイムを実質的に採用しているが、長いビデオ理解において複雑なシナリオに対処するのに苦労している。 Instructed Temporal Grounding for Videos (VideoITG) を提案する。 VideoITGの中核は、人間のアノテーションプロセスを明示的に模倣する自動アノテーションフレームワークであるVidThinkerパイプラインである。まず、命令に条件付けされたクリップレベルの詳細なキャプションを生成し、次に、命令誘導推論により関連するビデオセグメントを検索し、最後に、最も情報に富む視覚的証拠を特定するために、きめ細かいフレーム選択を行う。 VidThinkerを活用することで、40Kビデオと500K指示の時間的接地アノテーションを含むVideoITG-40Kデータセットを構築した。次に,視覚言語アライメントと Video-LLM の推論機能を利用して,効果的フレーム選択を識別的に行うプラグイン・アンド・プレイ・ビデオITG モデルを設計する。 Video-LLMsと組み合わせることで、VideoITGは複数のマルチモーダルビデオ理解ベンチマークで一貫したパフォーマンス向上を実現し、ビデオ理解の優位性と大きな可能性を示している。

関連論文リスト

Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
本稿では,自然言語クエリに基づいて映像中の時間的モーメントを正確に局所化する,ユニバーサルビデオ時間的グラウンドの計算モデルを提案する。生成型マルチモーダル大言語モデル(MLLM)の強力な視覚言語理解機能を活用した,堅牢で普遍的なビデオグラウンドモデルUniTimeを提案する。我々のモデルは、複雑な言語クエリを解釈しながら、多様なビュー、ジャンル、長さの動画を効果的に処理する。
論文参考訳（メタデータ） (2025-06-23T17:53:18Z)
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos [25.770675590118547]
VideoRAGは、非常に長いコンテキストのビデオの処理と理解に特化して設計された最初の検索拡張生成フレームワークである。我々の中心となる革新は、(i)グラフベースのテキスト知識をシームレスに統合し、(ii)視覚的特徴を効率的に保存するマルチモーダルコンテキストエンコーディングである。
論文参考訳（メタデータ） (2025-02-03T17:30:19Z)
VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAGは、クエリによる関連性に基づいて、動的にビデオを取得するフレームワークである。 VideoRAGは近年のLVLM(Large Video Language Models)を利用している。我々は,ビデオRAGの有効性を実験的に検証し,関連するベースラインよりも優れていることを示す。
論文参考訳（メタデータ） (2025-01-10T11:17:15Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
空間的詳細と時間的コヒーレンスを保持するビデオQAペアを特徴とする,新しいデータセットであるVideoEspressoを紹介する。 GPT-4o を用いた QA ペア生成にあたり, 冗長性を抑えるためにセマンティック・アウェア法を用いて構成パイプラインを構築した。フレームセレクタと2段階の命令微調整推論LVLMを備えたハイブリッドLVLM協調フレームワークを提案する。
論文参考訳（メタデータ） (2024-11-22T08:33:36Z)
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding [15.959757105308238]
ビデオLMMは、視覚入力を処理するために、画像エンコーダまたはビデオエンコーダに依存しており、それぞれに独自の制限がある。本稿では,映像エンコーダと映像エンコーダの相補的利点(大域的時間文脈モデリング)を組み合わせたビデオGPT+を紹介する。本稿では,VCGBench,MVBench,Zero-shotなど,複数のビデオベンチマークのパフォーマンス向上を示す。
論文参考訳（メタデータ） (2024-06-13T17:59:59Z)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2は、ビデオファウンデーションモデル(FM)の新たなファミリーで、ビデオ認識、ビデオ音声タスク、ビデオ中心タスクの最先端の結果を達成する。私たちのコアデザインは、マスク付きビデオモデリング、クロスコントラスト学習、予測トークンを統合し、最大6Bビデオサイズまでスケールアップするプログレッシブトレーニングアプローチです。
論文参考訳（メタデータ） (2024-03-22T17:57:42Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
ダイナミックスビデオのモデリングのため、ビデオ事前トレーニングは難しい。本稿では,ビデオ事前学習におけるこのような制限を,効率的なビデオ分解によって解決する。筆者らのフレームワークは,13のマルチモーダルベンチマークにおいて,画像と映像のコンテントの理解と生成が可能であることを実証した。
論文参考訳（メタデータ） (2024-02-05T16:30:49Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoMは、Large Language Models (LLM)を活用して、軽量なビジュアルツールを使用して動画を推論する高速適応フレームワークである。 InsOVERアルゴリズムは、言語命令の分解とビデオイベントの間の効率的なハンガリー語マッチングに基づいて、対応するビデオイベントを特定する。
論文参考訳（メタデータ） (2023-10-16T17:05:56Z)
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [90.71796406228265]
InternVidは大規模なビデオ中心のマルチモーダルデータセットで、強力で転送可能なビデオテキスト表現の学習を可能にする。 InternVidデータセットは700万本以上のビデオが760万時間近く持続し、合計4.1Bワードの詳細な記述を伴う234万本のビデオクリップが生成される。
論文参考訳（メタデータ） (2023-07-13T17:58:32Z)
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
我々は、少数のビデオ推論のシーケンシャルな理解として、フレーミングビデオ推論を動機付けている。 VIPは、ビデオチェーンオブ思考を通してモデルの推論能力を調べるために設計された、推論時の課題データセットである。我々は、VIP上でGPT-4、GPT-3、VICUNAをベンチマークし、複雑なビデオ推論タスクのパフォーマンスギャップを実証し、今後の作業を促進する。
論文参考訳（メタデータ） (2023-05-23T10:26:42Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。