Associating Objects with Transformers for Video Object Segmentation
- URL: http://arxiv.org/abs/2106.02638v1
- Date: Fri, 4 Jun 2021 17:59:57 GMT
- Title: Associating Objects with Transformers for Video Object Segmentation
- Authors: Zongxin Yang, Yunchao Wei, Yi Yang
- Abstract summary: We propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly.
AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space.
We ranked 1st in the 3rd Large-scale Video Object Challenge.
- Score: 74.51719591192787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates how to realize better and more efficient embedding
learning to tackle the semi-supervised video object segmentation under
challenging multi-object scenarios. The state-of-the-art methods learn to
decode features with a single positive object and thus have to match and
segment each target separately under multi-object scenarios, consuming multiple
times computing resources. To solve the problem, we propose an Associating
Objects with Transformers (AOT) approach to match and decode multiple objects
uniformly. In detail, AOT employs an identification mechanism to associate
multiple targets into the same high-dimensional embedding space. Thus, we can
simultaneously process the matching and segmentation decoding of multiple
objects as efficiently as processing a single object. For sufficiently modeling
multi-object association, a Long Short-Term Transformer is designed for
constructing hierarchical matching and propagation. We conduct extensive
experiments on both multi-object and single-object benchmarks to examine AOT
variant networks with different complexities. Particularly, our AOT-L
outperforms all the state-of-the-art competitors on three popular benchmarks,
i.e., YouTube-VOS (83.7% J&F), DAVIS 2017 (83.0%), and DAVIS 2016 (91.0%),
while keeping better multi-object efficiency. Meanwhile, our AOT-T can maintain
real-time multi-object speed on above benchmarks. We ranked 1st in the 3rd
Large-scale Video Object Segmentation Challenge. The code will be publicly
available at https://github.com/z-x-yang/AOT.
Related papers
- OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - ISAR: A Benchmark for Single- and Few-Shot Object Instance Segmentation
and Re-Identification [24.709695178222862]
We propose ISAR, a benchmark and baseline method for single- and few-shot object identification.
We provide a semi-synthetic dataset of video sequences with ground-truth semantic annotations.
Our benchmark aligns with the emerging research trend of unifying Multi-Object Tracking, Video Object, and Re-identification.
arXiv Detail & Related papers (2023-11-05T18:51:33Z) - CASAPose: Class-Adaptive and Semantic-Aware Multi-Object Pose Estimation [2.861848675707602]
We present a new single-stage architecture called CASAPose.
It determines 2D-3D correspondences for pose estimation of multiple different objects in RGB images in one pass.
It is fast and memory efficient, and achieves high accuracy for multiple objects.
arXiv Detail & Related papers (2022-10-11T10:20:01Z) - BURST: A Benchmark for Unifying Object Recognition, Segmentation and
Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video.
There is little interaction between them due to the use of disparate benchmark datasets and metrics.
We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks.
All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.