Matcher: Segment Anything with One Shot Using All-Purpose Feature
Matching
- URL: http://arxiv.org/abs/2305.13310v2
- Date: Fri, 19 Jan 2024 13:03:04 GMT
- Title: Matcher: Segment Anything with One Shot Using All-Purpose Feature
Matching
- Authors: Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, Chunhua Shen
- Abstract summary: We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks.
Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training.
Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
- Score: 63.88319217738223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Powered by large-scale pre-training, vision foundation models exhibit
significant potential in open-world image understanding. However, unlike large
language models that excel at directly tackling various language tasks, vision
foundation models require a task-specific model structure followed by
fine-tuning on specific tasks. In this work, we present Matcher, a novel
perception paradigm that utilizes off-the-shelf vision foundation models to
address various perception tasks. Matcher can segment anything by using an
in-context example without training. Additionally, we design three effective
components within the Matcher framework to collaborate with these foundation
models and unleash their full potential in diverse perception tasks. Matcher
demonstrates impressive generalization performance across various segmentation
tasks, all without training. For example, it achieves 52.7% mIoU on COCO-20$^i$
with one example, surpassing the state-of-the-art specialist model by 1.6%. In
addition, Matcher achieves 33.0% mIoU on the proposed LVIS-92$^i$ for one-shot
semantic segmentation, outperforming the state-of-the-art generalist model by
14.4%. Our visualization results further showcase the open-world generality and
flexibility of Matcher when applied to images in the wild. Our code can be
found at https://github.com/aim-uofa/Matcher.
Related papers
- Towards a Generalist and Blind RGB-X Tracker [91.36268768952755]
We develop a single model tracker that can remain blind to any modality X during inference time.
Our training process is extremely simple, integrating multi-label classification loss with a routing function.
Our generalist and blind tracker can achieve competitive performance compared to well-established modal-specific models.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Towards Unseen Triples: Effective Text-Image-joint Learning for Scene
Graph Generation [30.79358827005448]
Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images.
Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets.
We propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models.
arXiv Detail & Related papers (2023-06-23T10:17:56Z) - Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks [86.66733026149892]
We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
arXiv Detail & Related papers (2022-11-17T18:59:52Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.