Tracking Objects and Activities with Attention for Temporal Sentence
Grounding
- URL: http://arxiv.org/abs/2302.10813v1
- Date: Tue, 21 Feb 2023 16:42:52 GMT
- Title: Tracking Objects and Activities with Attention for Temporal Sentence
Grounding
- Authors: Zeyu Xiong, Daizong Liu, Pan Zhou, Jiahao Zhu
- Abstract summary: Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment.
We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
- Score: 51.416914256782505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal sentence grounding (TSG) aims to localize the temporal segment which
is semantically aligned with a natural language query in an untrimmed
video.Most existing methods extract frame-grained features or object-grained
features by 3D ConvNet or detection network under a conventional TSG framework,
failing to capture the subtle differences between frames or to model the
spatio-temporal behavior of core persons/objects. In this paper, we introduce a
new perspective to address the TSG task by tracking pivotal objects and
activities to learn more fine-grained spatio-temporal behaviors. Specifically,
we propose a novel Temporal Sentence Tracking Network (TSTNet), which contains
(A) a Cross-modal Targets Generator to generate multi-modal templates and
search space, filtering objects and activities, and (B) a Temporal Sentence
Tracker to track multi-modal targets for modeling the targets' behavior and to
predict query-related segment. Extensive experiments and comparisons with
state-of-the-arts are conducted on challenging benchmarks: Charades-STA and
TACoS. And our TSTNet achieves the leading performance with a considerable
real-time speed.
Related papers
- ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking [0.5371337604556311]
Efficiently modeling-temporal relations of objects is a key challenge in visual object tracking (VOT)
Existing methods track by appearance-based similarity or long-term relation modeling, resulting in rich temporal contexts between consecutive frames being easily overlooked.
In this paper we present ACTrack, a new framework with additive pre-temporal tracking framework with large memory conditions. It preserves the quality and capabilities of the pre-trained backbone by freezing its parameters, and makes a trainable lightweight additive net to model temporal relations in tracking.
We design an additive siamese convolutional network to ensure the integrity of spatial features and temporal sequence
arXiv Detail & Related papers (2024-02-27T07:34:08Z) - Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - ProContEXT: Exploring Progressive Context Transformer for Tracking [20.35886416084831]
Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template.
This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames.
We revamped the framework with Progressive Context.
Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories.
arXiv Detail & Related papers (2022-10-27T14:47:19Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - SpOT: Spatiotemporal Modeling for 3D Object Tracking [68.12017780034044]
3D multi-object tracking aims to consistently identify all mobile time.
Current 3D tracking methods rely on abstracted information and limited history.
We develop a holistic representation of scenes that leverage both spatial and temporal information.
arXiv Detail & Related papers (2022-07-12T21:45:49Z) - STURE: Spatial-Temporal Mutual Representation Learning for Robust Data
Association in Online Multi-Object Tracking [7.562844934117318]
The proposed approach is capable of extracting more distinguishing detection and sequence representations.
It is applied to the public MOT challenge benchmarks and performs well compared with various state-of-the-art online MOT trackers.
arXiv Detail & Related papers (2022-01-18T08:52:40Z) - Multi-Object Tracking and Segmentation with a Space-Time Memory Network [12.043574473965318]
We propose a method for multi-object tracking and segmentation based on a novel memory-based mechanism to associate tracklets.
The proposed tracker, MeNToS, addresses particularly the long-term data association problem.
arXiv Detail & Related papers (2021-10-21T17:13:17Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.