Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation
- URL: http://arxiv.org/abs/2106.11958v1
- Date: Tue, 22 Jun 2021 17:57:24 GMT
- Title: Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation
- Authors: Lei Ke, Xia Li, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang and
Fisher Yu
- Abstract summary: Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
- Score: 95.74244714914052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiple object tracking and segmentation requires detecting, tracking, and
segmenting objects belonging to a set of given classes. Most approaches only
exploit the temporal dimension to address the association problem, while
relying on single frame predictions for the segmentation mask itself. We
propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich
spatio-temporal information for online multiple object tracking and
segmentation. PCAN first distills a space-time memory into a set of prototypes
and then employs cross-attention to retrieve rich information from the past
frames. To segment each object, PCAN adopts a prototypical appearance module to
learn a set of contrastive foreground and background prototypes, which are then
propagated over time. Extensive experiments demonstrate that PCAN outperforms
current video instance tracking and segmentation competition winners on both
Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and
two-stage segmentation frameworks. Code will be available at
http://vis.xyz/pub/pcan.
Related papers
- Lidar Panoptic Segmentation and Tracking without Bells and Whistles [48.078270195629415]
We propose a detection-centric network for lidar segmentation and tracking.
One of the core components of our network is the object instance detection branch.
We evaluate our method on several 3D/4D LPS benchmarks and observe that our model establishes a new state-of-the-art among open-sourced models.
arXiv Detail & Related papers (2023-10-19T04:44:43Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual
Tracking and Segmentation [37.85026590250023]
This paper proposes a Multi-object Mask-box Integrated framework for unified Tracking and representation.
A novel pinpoint box predictor is proposed for accurate multi-object box prediction.
MITS achieves state-of-the-art performance on both Visual Object Tracking (VOT) and Video Object Tracking (VOS) benchmarks.
arXiv Detail & Related papers (2023-08-25T09:37:51Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Revisiting Sequence-to-Sequence Video Object Segmentation with
Multi-Task Loss and Skip-Memory [4.343892430915579]
Video Object (VOS) is an active research area of the visual domain.
Current approaches lose objects in longer sequences, especially when the object is small or briefly occluded.
We build upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data.
arXiv Detail & Related papers (2020-04-25T15:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.