DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
- URL: http://arxiv.org/abs/2502.17157v2
- Date: Tue, 25 Feb 2025 04:40:43 GMT
- Title: DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
- Authors: Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen,
- Abstract summary: We create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data.<n>Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models.<n>We show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation.
- Score: 51.439283251703635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo.
Related papers
- Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC [77.8851460746251]
We propose a simple, efficient, and general approach to fine-tune diffusion models.<n> ONE-PIC enhances the inherited generative ability in the pretrained diffusion models without introducing additional modules.<n>Our method is simple and efficient which streamlines the adaptation process and achieves excellent performance with lower costs.
arXiv Detail & Related papers (2024-12-07T11:19:32Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.
We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.
To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - MOFI: Learning Image Representations from Noisy Entity Annotated Images [47.6984817573981]
We present MOFI, a new vision foundation model designed to learn image representations from noisy entity annotated images.
We introduce a new approach to automatically assign entity labels to images from noisy image-text pairs.
Our approach involves employing a named entity recognition model to extract entities from the alt-text, and then using a CLIP model to select the correct entities as labels of the paired image.
arXiv Detail & Related papers (2023-06-13T17:51:18Z) - Matcher: Segment Anything with One Shot Using All-Purpose Feature
Matching [63.88319217738223]
We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks.
Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training.
Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
arXiv Detail & Related papers (2023-05-22T17:59:43Z) - Images Speak in Images: A Generalist Painter for In-Context Visual
Learning [98.78475432114595]
In-context learning allows the model to rapidly adapt to various tasks with only a handful of prompts and examples.
It is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks.
We present Painter, a generalist model which redefines the output of core vision tasks as images, and specify task prompts as also images.
arXiv Detail & Related papers (2022-12-05T18:59:50Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.