Related papers: Improving the Efficiency of Visually Augmented Language Models

Improving the Efficiency of Visually Augmented Language Models

URL: http://arxiv.org/abs/2409.11148v1
Date: Tue, 17 Sep 2024 13:02:19 GMT
Title: Improving the Efficiency of Visually Augmented Language Models
Authors: Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune,
Abstract summary: This paper shows that explicit images are not necessary to visually augment an LM. Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system. We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks.
Score: 5.948051066733892
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the impressive performance of autoregressive Language Models (LM) it has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they do not know much about the visual world and its properties. To augment LMs with visual knowledge, existing solutions often rely on explicit images, requiring time-consuming retrieval or image generation systems. This paper shows that explicit images are not necessary to visually augment an LM. Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system. For a fair comparison, we modify VALM, a visually-augmented LM which uses image retrieval and representation, to work directly with visually-grounded text representations. We name this new model BLIND-VALM. We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks, despite being significantly more efficient and simpler. We also show that scaling up our model within the compute budget of VALM, either increasing the model or pre-training corpus size, we outperform VALM for all the evaluation tasks.

Related papers

Re-Aligning Language to Visual Objects with an Agentic Workflow [73.73778652260911]
Language-based object detection aims to align visual objects with language expressions. Recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects. We propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts.
arXiv Detail & Related papers (2025-03-30T16:41:12Z)
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images [7.823336661261962]
Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors. We propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details.
arXiv Detail & Related papers (2025-02-19T18:05:42Z)
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures [5.513631883813244]
We propose a framework that textbfPre-textbfIntegratestextbfPrompt information into the visual encoding process using existingmodules of MLLMs. Our model maintains excellent generation even when half of the visual tokens are reduced.
arXiv Detail & Related papers (2024-10-30T15:05:17Z)
Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image. We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z)
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm. We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training. VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z)
Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling. With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling. We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.