Images in Language Space: Exploring the Suitability of Large Language
Models for Vision & Language Tasks
- URL: http://arxiv.org/abs/2305.13782v1
- Date: Tue, 23 May 2023 07:50:36 GMT
- Title: Images in Language Space: Exploring the Suitability of Large Language
Models for Vision & Language Tasks
- Authors: Sherzod Hakimov, David Schlangen
- Abstract summary: Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms.
multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models.
We make visual information accessible to the language model using separate verbalisation models.
- Score: 17.97052348690598
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models have demonstrated robust performance on various
language tasks using zero-shot or few-shot learning paradigms. While being
actively researched, multimodal models that can additionally handle images as
input have yet to catch up in size and generality with language-only models. In
this work, we ask whether language-only models can be utilised for tasks that
require visual input -- but also, as we argue, often require a strong reasoning
component. Similar to some recent related work, we make visual information
accessible to the language model using separate verbalisation models.
Specifically, we investigate the performance of open-source, open-access
language models against GPT-3 on five vision-language tasks when given
textually-encoded visual information. Our results suggest that language models
are effective for solving vision-language tasks even with limited samples. This
approach also enhances the interpretability of a model's output by providing a
means of tracing the output back through the verbalised image content.
Related papers
- EVLM: An Efficient Vision-Language Model for Visual Understanding [18.794601813330715]
This paper proposes an efficient multi-modal language model to minimize computational costs.
Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.
arXiv Detail & Related papers (2024-07-19T10:09:51Z) - MIVC: Multiple Instance Visual Component for Visual-Language Models [46.869139462026]
We propose MIVC, a general multiple instance visual component to bridge the gap between various image inputs with off-the-shelf vision-language models.
We show that MIVC could be plugged into the visual-language models to improve the model performance consistently on visual question answering, classification and captioning tasks.
arXiv Detail & Related papers (2023-12-28T16:33:32Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - VLIS: Unimodal Language Models Guide Multimodal Language Generation [23.094728230459125]
We introduce Visual-Language models as Importance Sampling weights (VLIS)
It combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training.
VLIS improves vision-language models on diverse tasks, including commonsense understanding and complex text generation.
arXiv Detail & Related papers (2023-10-15T07:58:52Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.