Related papers: Described Spatial-Temporal Video Detection

Described Spatial-Temporal Video Detection

URL: http://arxiv.org/abs/2407.05610v1
Date: Mon, 8 Jul 2024 04:54:39 GMT
Title: Described Spatial-Temporal Video Detection
Authors: Wei Ji, Xiangyan Liu, Yingfei Sun, Jiajun Deng, You Qin, Ammar Nuwanna, Mengyao Qiu, Lina Wei, Roger Zimmermann,
Abstract summary: spatial-temporal video grounding (STVG) is formulated to only detect one pre-existing object in each frame. In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD) DVD-ST supports grounding from none to many objects onto the video in response to queries.
Score: 33.69632963941608
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting visual content on language expression has become an emerging topic in the community. However, in the video domain, the existing setting, i.e., spatial-temporal video grounding (STVG), is formulated to only detect one pre-existing object in each frame, ignoring the fact that language descriptions can involve none or multiple entities within a video. In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD) by overcoming the above limitation. To facilitate the exploration of DSTVD, we first introduce a new benchmark, namely DVD-ST. Notably, DVD-ST supports grounding from none to many objects onto the video in response to queries and encompasses a diverse range of over 150 entities, including appearance, actions, locations, and interactions. The extensive breadth and diversity of the DVD-ST dataset make it an exemplary testbed for the investigation of DSTVD. In addition to the new benchmark, we further present two baseline methods for our proposed DSTVD task by extending two representative STVG models, i.e., TubeDETR, and STCAT. These extended models capitalize on tubelet queries to localize and track referred objects across the video sequence. Besides, we adjust the training objectives of these models to optimize spatial and temporal localization accuracy and multi-class classification capabilities. Furthermore, we benchmark the baselines on the introduced DVD-ST dataset and conduct extensive experimental analysis to guide future investigation. Our code and benchmark will be publicly available.

Related papers

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting [60.58915701973593]
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning. Cat-V integrates three key components: a Segmenter based on SAMI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-UniVL, and a Captioner using Intern-2.5. Our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data.
arXiv Detail & Related papers (2025-04-07T22:35:36Z)
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query. Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases. We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z)
VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries. We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR) Existing methods rely on separate pre-training feature extractors for visual and textual understanding. We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
Siamese Tracking with Lingual Object Constraints [28.04334832366449]
This paper explores, tracking visual objects subjected to additional lingual constraints. Differently from Li et al., we impose additional lingual constraints upon tracking, which enables new applications of tracking. Our method enables the selective compression of videos, based on the validity of the constraint.
arXiv Detail & Related papers (2020-11-23T20:55:08Z)
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z)
Human-centric Spatio-Temporal Video Grounding With Visual Transformers [70.50326310780407]
We introduce a novel task - Human Spatio-Temporal Video Grounding (HC-STVG) HC-STVG aims to localize atemporal tube of the target person from an un video based on a given description. We tackle this task by proposing an effective baseline method named S-Temporal Grounding with Visual Transformers (STGVT)
arXiv Detail & Related papers (2020-11-10T11:23:38Z)
BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos. Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces. BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z)
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried. Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling. We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.