ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised
Video Object Segmentation
- URL: http://arxiv.org/abs/2307.02010v2
- Date: Mon, 10 Jul 2023 09:20:29 GMT
- Title: ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised
Video Object Segmentation
- Authors: Jiahao Li, Yuanyou Xu, Zongxin Yang, Yi Yang, Yueting Zhuang
- Abstract summary: We introduce MSDeAOT, a variant of the AOT framework that incorporates transformers at multiple feature scales.
MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16.
We also employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects.
- Score: 62.98078087018469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Associating Objects with Transformers (AOT) framework has exhibited
exceptional performance in a wide range of complex scenarios for video object
segmentation. In this study, we introduce MSDeAOT, a variant of the AOT series
that incorporates transformers at multiple feature scales. Leveraging the
hierarchical Gated Propagation Module (GPM), MSDeAOT efficiently propagates
object masks from previous frames to the current frame using a feature scale
with a stride of 16. Additionally, we employ GPM in a more refined feature
scale with a stride of 8, leading to improved accuracy in detecting and
tracking small objects. Through the implementation of test-time augmentations
and model ensemble techniques, we achieve the top-ranking position in the
EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge.
Related papers
- Contrastive Learning for Multi-Object Tracking with Transformers [79.61791059432558]
We show how DETR can be turned into a MOT model by employing an instance-level contrastive loss.
Our training scheme learns object appearances while preserving detection capabilities and with little overhead.
Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset.
arXiv Detail & Related papers (2023-11-14T10:07:52Z) - ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single
Object Tracking [62.98078087018469]
We introduce MSDeAOT, a variant of the AOT framework that incorporates transformers at multiple feature scales.
MSDeAOT efficiently propagates object masks from previous frames to the current frame using two feature scales of 16 and 8.
As a testament to the effectiveness of our design, we achieved the 1st place in the EPIC-KITCHENS TREK-150 Object Tracking Challenge.
arXiv Detail & Related papers (2023-07-05T03:50:58Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Transformer Scale Gate for Semantic Segmentation [53.27673119360868]
Transformer Scale Gate (TSG) exploits cues in self and cross attentions in Vision Transformers for the scale selection.
Our experiments on the Pascal Context and ADE20K datasets demonstrate that our feature selection strategy achieves consistent gains.
arXiv Detail & Related papers (2022-05-14T13:11:39Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.