Visually-augmented pretrained language models for NLP tasks without
images
- URL: http://arxiv.org/abs/2212.07937v2
- Date: Fri, 26 May 2023 14:09:49 GMT
- Title: Visually-augmented pretrained language models for NLP tasks without
images
- Authors: Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Qinyu Zhang, and Ji-Rong Wen
- Abstract summary: Existing solutions often rely on explicit images for visual knowledge augmentation.
We propose a novel textbfVisually-textbfAugmented fine-tuning approach.
Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
- Score: 77.74849855049523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although pre-trained language models~(PLMs) have shown impressive performance
by text-only self-supervised training, they are found lack of visual semantics
or commonsense. Existing solutions often rely on explicit images for visual
knowledge augmentation (requiring time-consuming retrieval or generation), and
they also conduct the augmentation for the whole input text, without
considering whether it is actually needed in specific inputs or tasks. To
address these issues, we propose a novel \textbf{V}isually-\textbf{A}ugmented
fine-tuning approach that can be generally applied to various PLMs or NLP
tasks, \textbf{W}ithout using any retrieved or generated \textbf{I}mages,
namely \textbf{VAWI}. Experimental results show that our approach can
consistently improve the performance of BERT, RoBERTa, BART, and T5 at
different scales, and outperform several competitive baselines on ten tasks.
Our codes and data are publicly available
at~\url{https://github.com/RUCAIBox/VAWI}.
Related papers
- Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only.
Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z) - Tackling VQA with Pretrained Foundation Models without Further Training [0.0]
Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks.
With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA)
In this paper, we explore a method of combining pretrained LLMs and other foundation models without further training to solve the VQA problem.
arXiv Detail & Related papers (2023-09-27T08:35:24Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining
on Visual Language Understanding [13.300199242824934]
We investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning.
We propose a suite of visual language understanding tasks for probing the visual reasoning abilities of text encoder models.
We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks.
arXiv Detail & Related papers (2023-03-21T17:30:40Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - I Can't Believe There's No Images! Learning Visual Tasks Using only
Language Supervision [32.49636188029509]
We produce models using only text training data on four representative tasks.
We find these models perform close to models trained on images.
We showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data.
arXiv Detail & Related papers (2022-11-17T18:52:19Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.