Tracking Objects and Activities with Attention for Temporal Sentence
Grounding
- URL: http://arxiv.org/abs/2302.10813v1
- Date: Tue, 21 Feb 2023 16:42:52 GMT
- Title: Tracking Objects and Activities with Attention for Temporal Sentence
Grounding
- Authors: Zeyu Xiong, Daizong Liu, Pan Zhou, Jiahao Zhu
- Abstract summary: Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment.
We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
- Score: 51.416914256782505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal sentence grounding (TSG) aims to localize the temporal segment which
is semantically aligned with a natural language query in an untrimmed
video.Most existing methods extract frame-grained features or object-grained
features by 3D ConvNet or detection network under a conventional TSG framework,
failing to capture the subtle differences between frames or to model the
spatio-temporal behavior of core persons/objects. In this paper, we introduce a
new perspective to address the TSG task by tracking pivotal objects and
activities to learn more fine-grained spatio-temporal behaviors. Specifically,
we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains
(A) a Cross-modal Targets Generator to generate multi-modal templates and
search space, filtering objects and activities, and (B) a Temporal Sentence
Tracker to track multi-modal targets for modeling the targets' behavior and to
predict query-related segment. Extensive experiments and comparisons with
state-of-the-arts are conducted on challenging benchmarks: Charades-STA and
TACoS. And our TSTNet achieves the leading performance with a considerable
real-time speed.
Related papers
- ClickTrack: Towards Real-time Interactive Single Object Tracking [58.52366657445601]
We propose a new paradigm for single object tracking algorithms, ClickTrack, a new paradigm using clicking interaction for real-time scenarios.
To address ambiguity in certain special scenarios, we designed the Guided Click Refiner(GCR), which accepts point and optional textual information as inputs.
Experiments on LaSOT and GOT-10k benchmarks show that tracker combined with GCR achieves stable performance in real-time interactive scenarios.
arXiv Detail & Related papers (2024-11-20T10:30:33Z) - STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking [13.269416985959404]
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision.
We propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT)
We use historical embedding features to model the representation of ReID and detection features in a sequential order.
Our framework sets a new state-of-the-art performance in MOTA and IDF1 metrics.
arXiv Detail & Related papers (2024-09-17T14:34:18Z) - Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - ProContEXT: Exploring Progressive Context Transformer for Tracking [20.35886416084831]
Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template.
This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames.
We revamped the framework with Progressive Context.
Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories.
arXiv Detail & Related papers (2022-10-27T14:47:19Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - SpOT: Spatiotemporal Modeling for 3D Object Tracking [68.12017780034044]
3D multi-object tracking aims to consistently identify all mobile time.
Current 3D tracking methods rely on abstracted information and limited history.
We develop a holistic representation of scenes that leverage both spatial and temporal information.
arXiv Detail & Related papers (2022-07-12T21:45:49Z) - STURE: Spatial-Temporal Mutual Representation Learning for Robust Data
Association in Online Multi-Object Tracking [7.562844934117318]
The proposed approach is capable of extracting more distinguishing detection and sequence representations.
It is applied to the public MOT challenge benchmarks and performs well compared with various state-of-the-art online MOT trackers.
arXiv Detail & Related papers (2022-01-18T08:52:40Z) - Multi-Object Tracking and Segmentation with a Space-Time Memory Network [12.043574473965318]
We propose a method for multi-object tracking and segmentation based on a novel memory-based mechanism to associate tracklets.
The proposed tracker, MeNToS, addresses particularly the long-term data association problem.
arXiv Detail & Related papers (2021-10-21T17:13:17Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.