Related papers: Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

URL: http://arxiv.org/abs/2303.16341v3
Date: Sat, 7 Sep 2024 00:09:15 GMT
Title: Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
Authors: Yuanhao Xiong, Long Zhao, Boqing Gong, Ming-Hsuan Yang, Florian Schroff, Ting Liu, Cho-Jui Hsieh, Liangzhe Yuan,
Abstract summary: We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
Score: 112.3913646778859
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object correspondences and recognizing scene changes in a video clip, reflecting spatial and temporal granularity, respectively. To strengthen model's understanding into such fine-grained details, we propose a simple yet effective video-language modeling framework, S-ViLM, by exploiting the intrinsic structures of these two modalities. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features, simultaneously. Comprehensive evaluations demonstrate that S-ViLM performs favorably against existing approaches in learning more expressive representations. Specifically, S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks, covering text-video retrieval, video question answering, video action recognition, and temporal action localization.

Related papers

Video Understanding: Through A Temporal Lens [5.153774021264937]
This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding.<n>The work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers for efficient long-form video modeling; and (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments.
arXiv Detail & Related papers (2026-01-31T12:01:09Z)
Temporal Grounding as a Learning Signal for Referring Video Object Segmentation [29.646697516547558]
Referring Video Object (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries.<n>Existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training.<n>We introduce MeViS-M, a dataset built upon the challenging MeViS benchmark, where we manually annotate temporal spans when each object is referred to by the expression.
arXiv Detail & Related papers (2025-08-16T07:34:43Z)
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding [19.50051728766238]
We propose an innovative video representation method called Dynamic-Image (DynImg)<n>Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects.<n>During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions.
arXiv Detail & Related papers (2025-07-21T12:50:49Z)
Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos. We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
PcmNet: Position-Sensitive Context Modeling Network for Temporal Action Localization [11.685362686431446]
We propose a temporal-position-sensitive context modeling approach to incorporate both positional and semantic information for more precise action localization. We achieve state-of-the-art performance on both two challenging datasets, THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2021-03-09T07:34:01Z)
BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos. Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces. BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z)
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.