Fugu-MT 論文翻訳(概要): VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

論文の概要: VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

arxiv url: http://arxiv.org/abs/2601.07290v1
Date: Mon, 12 Jan 2026 07:51:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 19:08:01.272008
Title: VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
Title（参考訳）: VideoLoom: 共同空間時間理解のためのビデオ大言語モデル
Authors: Jiapeng Shi, Junke Wang, Zuyao You, Bo He, Zuxuan Wu,
Abstract要約: VideoLoomはビデオ大言語モデル (Video Large Language Model, ビデオ大言語モデル) である。時間的接地と空間的局所的なキャプションを備えた人間中心のビデオデータセットであるLoomData-8.7kを紹介する。また、時間的、空間的、構成的なビデオ検索ペアからなる新しいベンチマークであるLoomBenchを紹介する。
参考スコア（独自算出の注目度）: 46.97966072048103
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
Abstract（参考訳）: 本稿では,共同空間的時間的理解のための統合ビデオ大言語モデル(ビデオLLM)であるVideoLoomを提案する。微粒な空間的局所化機能と時間的局所化機能の開発を容易にするため,時間的背景と空間的局所化キャプションを備えた人間中心のビデオデータセットであるLoomData-8.7kをキュレートする。これにより、VideoLoomは、様々な空間的および時間的ベンチマーク(ビデオオブジェクトのセグメンテーションを参照するReVOSの63.1 J&F、時間的グラウンドのためのCharades-STAの48.3 R1@0.7など)で最先端または高い競争性能を達成する。さらに,時間的,空間的,構成的なビデオ探索ペアからなる新しいベンチマークであるLoomBenchを導入し,多様な側面からビデオLLMの総合的な評価を可能にする。これらのコントリビューションは、共同空間的時間的ビデオ理解のための普遍的で効果的なスイートを提供し、マルチモーダルインテリジェンスにおける新しい標準を設定している。

論文の概要: VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

関連論文リスト