Related papers: Tracking Objects and Activities with Attention for Temporal Sentence Grounding

Tracking Objects and Activities with Attention for Temporal Sentence Grounding

URL: http://arxiv.org/abs/2302.10813v1
Date: Tue, 21 Feb 2023 16:42:52 GMT
Title: Tracking Objects and Activities with Attention for Temporal Sentence Grounding
Authors: Zeyu Xiong, Daizong Liu, Pan Zhou, Jiahao Zhu
Abstract summary: Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment. We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
Score: 51.416914256782505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal sentence grounding (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed video.Most existing methods extract frame-grained features or object-grained features by 3D ConvNet or detection network under a conventional TSG framework, failing to capture the subtle differences between frames or to model the spatio-temporal behavior of core persons/objects. In this paper, we introduce a new perspective to address the TSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal behaviors. Specifically, we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal templates and search space, filtering objects and activities, and (B) a Temporal Sentence Tracker to track multi-modal targets for modeling the targets' behavior and to predict query-related segment. Extensive experiments and comparisons with state-of-the-arts are conducted on challenging benchmarks: Charades-STA and TACoS. And our TSTNet achieves the leading performance with a considerable real-time speed.

Related papers

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding [20.906378094998303]
Existing Transformer-based STVG approaches often leverage a set of object queries, which are simply using zeros. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information. We introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair.
arXiv Detail & Related papers (2025-02-16T15:38:33Z)
Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking [53.33637391723555]
We propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
arXiv Detail & Related papers (2024-12-20T09:10:17Z)
ClickTrack: Towards Real-time Interactive Single Object Tracking [58.52366657445601]
We propose a new paradigm for single object tracking algorithms, ClickTrack, a new paradigm using clicking interaction for real-time scenarios. To address ambiguity in certain special scenarios, we designed the Guided Click Refiner(GCR), which accepts point and optional textual information as inputs. Experiments on LaSOT and GOT-10k benchmarks show that tracker combined with GCR achieves stable performance in real-time interactive scenarios.
arXiv Detail & Related papers (2024-11-20T10:30:33Z)
STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking [13.269416985959404]
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. We propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT) We use historical embedding features to model the representation of ReID and detection features in a sequential order. Our framework sets a new state-of-the-art performance in MOTA and IDF1 metrics.
arXiv Detail & Related papers (2024-09-17T14:34:18Z)
Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge. Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time. This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)
ProContEXT: Exploring Progressive Context Transformer for Tracking [20.35886416084831]
Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. We revamped the framework with Progressive Context. Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories.
arXiv Detail & Related papers (2022-10-27T14:47:19Z)
End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z)
SpOT: Spatiotemporal Modeling for 3D Object Tracking [68.12017780034044]
3D multi-object tracking aims to consistently identify all mobile time. Current 3D tracking methods rely on abstracted information and limited history. We develop a holistic representation of scenes that leverage both spatial and temporal information.
arXiv Detail & Related papers (2022-07-12T21:45:49Z)
STURE: Spatial-Temporal Mutual Representation Learning for Robust Data Association in Online Multi-Object Tracking [7.562844934117318]
The proposed approach is capable of extracting more distinguishing detection and sequence representations. It is applied to the public MOT challenge benchmarks and performs well compared with various state-of-the-art online MOT trackers.
arXiv Detail & Related papers (2022-01-18T08:52:40Z)
Multi-Object Tracking and Segmentation with a Space-Time Memory Network [12.043574473965318]
We propose a method for multi-object tracking and segmentation based on a novel memory-based mechanism to associate tracklets. The proposed tracker, MeNToS, addresses particularly the long-term data association problem.
arXiv Detail & Related papers (2021-10-21T17:13:17Z)
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information. We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.