One does not fit all! On the Complementarity of Vision Encoders for
Vision and Language Tasks
- URL: http://arxiv.org/abs/2210.06379v2
- Date: Thu, 8 Jun 2023 15:42:13 GMT
- Title: One does not fit all! On the Complementarity of Vision Encoders for
Vision and Language Tasks
- Authors: Gregor Geigle, Chen Cecilia Liu, Jonas Pfeiffer and Iryna Gurevych
- Abstract summary: multimodal models are aimed at solving Vision and Language (V+L) tasks.
Current work assumes that a textitsingle pre-trained VE can serve as a general-purpose encoder.
In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary.
- Score: 59.49639580525051
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current multimodal models, aimed at solving Vision and Language (V+L) tasks,
predominantly repurpose Vision Encoders (VE) as feature extractors. While many
VEs -- of different architectures, trained on different data and objectives --
are publicly available, they are not designed for the downstream V+L tasks.
Nonetheless, most current work assumes that a \textit{single} pre-trained VE
can serve as a general-purpose encoder. In this work, we focus on analysis and
aim to understand whether the information stored within different VEs is
complementary, i.e. if providing the model with features from multiple VEs can
improve the performance on a target task, and how they are combined. We
exhaustively experiment with three popular VEs on six downstream V+L tasks and
analyze the attention and VE-dropout patterns. Our analyses suggest that
diverse VEs complement each other, resulting in improved downstream V+L task
performance, where the improvements are not due to simple ensemble effects
(i.e. the performance does not always improve when increasing the number of
encoders). We demonstrate that future VEs, which are not \textit{repurposed},
but explicitly \textit{designed} for V+L tasks, have the potential of improving
performance on the target V+L tasks.
Related papers
- ViSTa Dataset: Do vision-language models understand sequential tasks? [6.039062076849557]
Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety.
We introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks.
ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments.
arXiv Detail & Related papers (2024-11-20T11:19:22Z) - Intriguing Properties of Large Language and Vision Models [18.449076451976236]
Large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance.
Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks remains surprisingly low.
We investigate this question by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks.
arXiv Detail & Related papers (2024-10-07T05:07:01Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Unveiling Encoder-Free Vision-Language Models [62.52803514667452]
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks.
We bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs.
We launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently.
arXiv Detail & Related papers (2024-06-17T17:59:44Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding [33.33424214458285]
Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks.
However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge.
We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects.
arXiv Detail & Related papers (2023-11-30T03:20:37Z) - Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for
Vision-Language Tasks [118.49566068398642]
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Unimodal encoders are pretrained with simpler annotations that are less cost-prohibitive, achieving scales of hundreds of millions to billions.
We propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
arXiv Detail & Related papers (2022-04-22T04:41:04Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - How Much Can CLIP Benefit Vision-and-Language Tasks? [121.46042421728016]
We show that CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks.
We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
arXiv Detail & Related papers (2021-07-13T20:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.