UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
- URL: http://arxiv.org/abs/2312.15715v1
- Date: Mon, 25 Dec 2023 12:54:11 GMT
- Title: UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
- Authors: Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo
- Abstract summary: We propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture.
With the unified designs, UniRef++ can be jointly trained on a broad range of benchmarks and can flexibly complete multiple tasks at run-time.
Our proposed UniRef++ achieves state-of-the-art performance on RIS and RVOS, and performs competitively on FSS and VOS with a parameter-shared network.
- Score: 92.52589788633856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reference-based object segmentation tasks, namely referring image
segmentation (RIS), few-shot image segmentation (FSS), referring video object
segmentation (RVOS), and video object segmentation (VOS), aim to segment a
specific object by utilizing either language or annotated masks as references.
Despite significant progress in each respective field, current methods are
task-specifically designed and developed in different directions, which hinders
the activation of multi-task capabilities for these tasks. In this work, we end
the current fragmented situation and propose UniRef++ to unify the four
reference-based object segmentation tasks with a single architecture. At the
heart of our approach is the proposed UniFusion module which performs
multiway-fusion for handling different tasks with respect to their specified
references. And a unified Transformer architecture is then adopted for
achieving instance-level segmentation. With the unified designs, UniRef++ can
be jointly trained on a broad range of benchmarks and can flexibly complete
multiple tasks at run-time by specifying the corresponding references. We
evaluate our unified models on various benchmarks. Extensive experimental
results indicate that our proposed UniRef++ achieves state-of-the-art
performance on RIS and RVOS, and performs competitively on FSS and VOS with a
parameter-shared network. Moreover, we showcase that the proposed UniFusion
module could be easily incorporated into the current advanced foundation model
SAM and obtain satisfactory results with parameter-efficient finetuning. Codes
and models are available at \url{https://github.com/FoundationVision/UniRef}.
Related papers
- Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes [11.575313825919205]
We introduce a novel task called Reference Audio-Visual Traditional (Ref-AVS)
Ref-AVS seeks to segment objects based on expressions containing multimodal cues.
We propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance.
arXiv Detail & Related papers (2024-07-15T17:54:45Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - BURST: A Benchmark for Unifying Object Recognition, Segmentation and
Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video.
There is little interaction between them due to the use of disparate benchmark datasets and metrics.
We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks.
All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z) - Leveraging GAN Priors for Few-Shot Part Segmentation [43.35150430895919]
Few-shot part segmentation aims to separate different parts of an object given only a few samples.
We propose to learn task-specific features in a "pre-training"-"fine-tuning" paradigm.
arXiv Detail & Related papers (2022-07-27T10:17:07Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Instance-Specific Feature Propagation for Referring Segmentation [28.58551450280675]
Referring segmentation aims to generate a segmentation mask for the target instance indicated by a natural language expression.
We propose a novel framework that simultaneously detects the target-of-interest via feature propagation and generates a fine-grained segmentation mask.
arXiv Detail & Related papers (2022-04-26T07:08:14Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.