Related papers: Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

URL: http://arxiv.org/abs/2406.19255v1
Date: Thu, 27 Jun 2024 15:23:36 GMT
Title: Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Authors: Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, Shuicheng Yan,
Abstract summary: Finsta is a fine-grained structural-temporal alignment learning method. It consistently improves the existing 13 strong-tuning video-language models.
Score: 130.15775113897553
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.

Related papers

Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z)
Universal Scene Graph Generation [77.53076485727414]
We present Universal Universal SG (USG), a novel representation capable of characterizing comprehensive semantic scenes. We also introduce USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges.
arXiv Detail & Related papers (2025-03-19T08:55:06Z)
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding [29.42797944919497]
We propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the coordinate special token into the visual space. We also design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams.
arXiv Detail & Related papers (2025-01-14T17:58:12Z)
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V) We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z)
Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos. Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z)
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs [79.64891686479213]
We show that it is possible to improve vision and language models (VLMs) when learning from scene graphs (SGs) For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions. Our method improves the performance of several popular VLMs on multiple datasets with only a mild degradation in ZS capabilities.
arXiv Detail & Related papers (2023-05-10T17:52:26Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically. A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z)
Spatio-Temporal Ranked-Attention Networks for Video Captioning [34.05025890230047]
We propose a model that combines spatial and temporal attention to videos in two different orders. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T01:00:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.