Context-Guided Spatio-Temporal Video Grounding
- URL: http://arxiv.org/abs/2401.01578v1
- Date: Wed, 3 Jan 2024 07:05:49 GMT
- Title: Context-Guided Spatio-Temporal Video Grounding
- Authors: Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang
- Abstract summary: We propose a framework, context-guided STVG (CG-STVG), which mines discriminative instance context for object in videos.
CG-STVG enjoys object information in text query and guidance from mined instance visual context for more accurate target localization.
In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in m_tIoU and m_vIoU.
- Score: 22.839160907707885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal video grounding (or STVG) task aims at locating a
spatio-temporal tube for a specific instance given a text query. Despite
advancements, current methods easily suffer the distractors or heavy object
appearance variations in videos due to insufficient object information from the
text, leading to degradation. Addressing this, we propose a novel framework,
context-guided STVG (CG-STVG), which mines discriminative instance context for
object in videos and applies it as a supplementary guidance for target
localization. The key of CG-STVG lies in two specially designed modules,
including instance context generation (ICG), which focuses on discovering
visual context information (in both appearance and motion) of the instance, and
instance context refinement (ICR), which aims to improve the instance context
from ICG by eliminating irrelevant or even harmful information from the
context. During grounding, ICG, together with ICR, are deployed at each
decoding stage of a Transformer architecture for instance context learning.
Particularly, instance context learned from one decoding stage is fed to the
next stage, and leveraged as a guidance containing rich and discriminative
object feature to enhance the target-awareness in decoding feature, which
conversely benefits generating better new instance context for improving
localization finally. Compared to existing methods, CG-STVG enjoys object
information in text query and guidance from mined instance visual context for
more accurate target localization. In our experiments on three benchmarks,
including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in
m_tIoU and m_vIoU on all of them, showing its efficacy. The code will be
released at https://github.com/HengLan/CGSTVG.
Related papers
- Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization [22.58434223222062]
We propose a new few-shot temporal action localization method by Chain-of-Thought textual reasoning to improve localization performance.
Specifically, we design a novel few-shot learning framework that leverages textual semantic information to enhance the model's ability to capture action commonalities and variations.
We conduct extensive experiments on the publicly available ActivityNet1.3 and THUMOS14 datasets.
arXiv Detail & Related papers (2025-04-18T04:35:35Z) - Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment.<n>Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z) - Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding [20.906378094998303]
Existing Transformer-based STVG approaches often leverage a set of object queries, which are simply using zeros.
Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information.
We introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair.
arXiv Detail & Related papers (2025-02-16T15:38:33Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP.
We propose SDSGG, a scene-specific description based OVSGG framework.
To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z) - Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension [40.21084218601082]
This paper focuses on a challenging setup where target localization is learned directly from image-text pairs.
We propose a novel Progressive Network (PCNet) to leverage target-related textual cues for progressively localizing the target object.
Our method outperforms SOTA methods on three common benchmarks.
arXiv Detail & Related papers (2024-10-02T13:30:32Z) - See It All: Contextualized Late Aggregation for 3D Dense Captioning [38.14179122810755]
3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object.
Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components.
We introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation.
arXiv Detail & Related papers (2024-08-14T16:19:18Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Video Referring Expression Comprehension via Transformer with
Content-aware Query [60.89442448993627]
Video Referring Expression (REC) aims to localize a target object in video frames referred by the natural language expression.
We argue that the current query design is suboptima and suffers from two drawbacks.
We set up a fixed number of learnable bounding boxes across the frame and the aligned region features are employed to provide fruitful clues.
arXiv Detail & Related papers (2022-10-06T14:45:41Z) - Target Adaptive Context Aggregation for Video Scene Graph Generation [36.669700084337045]
This paper deals with a challenging task of video scene graph generation (VidSGG)
We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
arXiv Detail & Related papers (2021-08-18T12:46:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.