Track Targets by Dense Spatio-Temporal Position Encoding
- URL: http://arxiv.org/abs/2210.09455v1
- Date: Mon, 17 Oct 2022 22:04:39 GMT
- Title: Track Targets by Dense Spatio-Temporal Position Encoding
- Authors: Jinkun Cao, Hao Wu, Kris Kitani
- Abstract summary: We propose a novel paradigm to encode the position of targets for target tracking in videos using transformers.
The proposed encoding position provides location information to associate targets across frames beyond appearance matching.
Our encoding is applied to the 2D CNN features instead of the proposed feature vectors to avoid losing positional information.
- Score: 27.06820571703848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose a novel paradigm to encode the position of targets
for target tracking in videos using transformers. The proposed paradigm, Dense
Spatio-Temporal (DST) position encoding, encodes spatio-temporal position
information in a pixel-wise dense fashion. The provided position encoding
provides location information to associate targets across frames beyond
appearance matching by comparing objects in two bounding boxes. Compared to the
typical transformer positional encoding, our proposed encoding is applied to
the 2D CNN features instead of the projected feature vectors to avoid losing
positional information. Moreover, the designed DST encoding can represent the
location of a single-frame object and the evolution of the location of the
trajectory among frames uniformly. Integrated with the DST encoding, we build a
transformer-based multi-object tracking model. The model takes a video clip as
input and conducts the target association in the clip. It can also perform
online inference by associating existing trajectories with objects from the
new-coming frames. Experiments on video multi-object tracking (MOT) and
multi-object tracking and segmentation (MOTS) datasets demonstrate the
effectiveness of the proposed DST position encoding.
Related papers
- Transformer-based stereo-aware 3D object detection from binocular images [82.85433941479216]
We explore the model design of Transformers in binocular 3D object detection.
To achieve this goal, we present TS3D, a Stereo-aware 3D object detector.
Our proposed TS3D achieves a 41.29% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair.
arXiv Detail & Related papers (2023-04-24T08:29:45Z) - Siamese Network with Interactive Transformer for Video Object
Segmentation [34.202137199782804]
We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames.
We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
arXiv Detail & Related papers (2021-12-28T03:38:17Z) - Geometry Attention Transformer with Position-aware LSTMs for Image
Captioning [8.944233327731245]
This paper proposes an improved Geometry Attention Transformer (GAT) model.
In order to further leverage geometric information, two novel geometry-aware architectures are designed.
Our GAT could often outperform current state-of-the-art image captioning models.
arXiv Detail & Related papers (2021-10-01T11:57:50Z) - Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor
Segmentation [90.74732705236336]
Language-queried video actor segmentation aims to predict the pixel-mask of the actor which performs the actions described by a natural language query in the target frames.
We propose a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoder over the target frame to accurately segment the queried actors.
arXiv Detail & Related papers (2021-05-14T13:27:53Z) - Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing.
The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z) - TrackFormer: Multi-Object Tracking with Transformers [92.25832593088421]
TrackFormer is an end-to-end multi-object tracking and segmentation model based on an encoder-decoder Transformer architecture.
New track queries are spawned by the DETR object detector and embed the position of their corresponding object over time.
TrackFormer achieves a seamless data association between frames in a new tracking-by-attention paradigm.
arXiv Detail & Related papers (2021-01-07T18:59:29Z) - Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection
in Autonomous Driving [121.44554957537613]
We propose a new transformer, called Temporal-Channel Transformer, to model the spatial-temporal domain and channel domain relationships for video object detecting from Lidar data.
Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames.
We achieve the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.
arXiv Detail & Related papers (2020-11-27T09:35:39Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.