Self-Supervised Open-Ended Classification with Small Visual Language
Models
- URL: http://arxiv.org/abs/2310.00500v2
- Date: Wed, 6 Dec 2023 13:16:52 GMT
- Title: Self-Supervised Open-Ended Classification with Small Visual Language
Models
- Authors: Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G. M. Snoek,
Marcel Worring, Yuki M. Asano
- Abstract summary: We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models.
By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
- Score: 60.23212389067007
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present Self-Context Adaptation (SeCAt), a self-supervised approach that
unlocks few-shot abilities for open-ended classification with small visual
language models. Our approach imitates image captions in a self-supervised way
based on clustering a large pool of images followed by assigning
semantically-unrelated names to clusters. By doing so, we construct a training
signal consisting of interleaved sequences of image and pseudocaption pairs and
a query image, which we denote as the 'self-context' sequence. Based on this
signal the model is trained to produce the right pseudo-caption. We demonstrate
the performance and flexibility of SeCAt on several multimodal few-shot
datasets, spanning various granularities. By using models with approximately 1B
parameters we outperform the few-shot abilities of much larger models, such as
Frozen and FROMAGe. SeCAt opens new possibilities for research and applications
in open-ended few-shot learning that otherwise requires access to large or
proprietary models.
Related papers
- A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks.
Our approach enables versatile capabilities via different inference-time sampling schemes.
Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z) - Boosting Few-Shot Detection with Large Language Models and Layout-to-Image Synthesis [1.1633929083694388]
We propose a framework for enhancing few-shot detection beyond state-of-the-art generative augmentation approaches.
We introduce our novel layout-aware CLIP score for sample ranking, enabling tight coupling between generated layouts and images.
With our approach, a YOLOX-S baseline is boosted by more than 140%, 50%, 35% in mAP on the COCO 5-,10-, and 30-shot settings.
arXiv Detail & Related papers (2024-10-09T12:57:45Z) - Investigating Self-Supervised Methods for Label-Efficient Learning [27.029542823306866]
We study different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities.
We introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks.
When testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.
arXiv Detail & Related papers (2024-06-25T10:56:03Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Hierarchical Few-Shot Generative Models [18.216729811514718]
We study a latent variables approach that extends the Neural Statistician to a fully hierarchical approach with an attention-based point to set-level aggregation.
Our results show that the hierarchical formulation better captures the intrinsic variability within the sets in the small data regime.
arXiv Detail & Related papers (2021-10-23T19:19:39Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.