Does language help generalization in vision models?
- URL: http://arxiv.org/abs/2104.08313v1
- Date: Fri, 16 Apr 2021 18:54:14 GMT
- Title: Does language help generalization in vision models?
- Authors: Benjamin Devillers, Romain Bielawski, Bhavin Choski and Rufin
VanRullen
- Abstract summary: We show that a visual model trained on a very large supervised image dataset (ImageNet-21k) can be as efficient for generalization as its multimodal counterpart (CLIP)
When compared to other standard visual or language models, the latent representations of BiT-M were found to be just as "linguistic" as those of CLIP.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision models trained on multimodal datasets have recently proved very
efficient, both in terms of the wide availability of large image-caption
datasets, and in terms of the resulting model's ability to generalize to
multiple downstream tasks (e.g. zero-shot learning). One might assume that
these abilities are derived, at least in part, from a "semantic grounding" of
the visual feature space, learning meaningful structure by mirroring the space
of linguistic representations. Contrary to this intuition, we show that a
visual model (BiT-M) trained on a very large supervised image dataset
(ImageNet-21k) can be as efficient for generalization (few-shot learning,
unsupervised clustering) as its multimodal counterpart (CLIP). When compared to
other standard visual or language models, the latent representations of BiT-M
were found to be just as "linguistic" as those of CLIP. Overall, these findings
suggest that the main factor driving improvements of generalization in current
models is the size of the training dataset, not (solely) the multimodal
grounding property.
Related papers
- Unified Lexical Representation for Interpretable Visual-Language Alignment [52.059812317944434]
We introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design.
We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset.
arXiv Detail & Related papers (2024-07-25T07:35:27Z) - EVLM: An Efficient Vision-Language Model for Visual Understanding [18.794601813330715]
This paper proposes an efficient multi-modal language model to minimize computational costs.
Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.
arXiv Detail & Related papers (2024-07-19T10:09:51Z) - Continually Learn to Map Visual Concepts to Large Language Models in Resource-constrained Environments [7.481446615819558]
Continual Visual Mapping (CVM) is an approach that continually ground vision representations to a knowledge space extracted from a fixed Language model.
CVM overcomes state-of-the-art continual learning methods on five benchmarks.
arXiv Detail & Related papers (2024-07-11T08:28:40Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - LLM2Loss: Leveraging Language Models for Explainable Model Diagnostics [5.33024001730262]
We propose an approach that can provide semantic insights into a model's patterns of failures and biases.
We show that an ensemble of such lightweight models can be used to generate insights on the performance of the black-box model.
arXiv Detail & Related papers (2023-05-04T23:54:37Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.