Imagination-Augmented Natural Language Understanding
- URL: http://arxiv.org/abs/2204.08535v1
- Date: Mon, 18 Apr 2022 19:39:36 GMT
- Title: Imagination-Augmented Natural Language Understanding
- Authors: Yujie Lu, Wanrong Zhu, Xin Eric Wang, Miguel Eckstein, William Yang
Wang
- Abstract summary: We introduce an Imagination-Augmented Cross-modal (iACE) to solve natural language understanding tasks.
iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models.
Experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models.
- Score: 71.51687221130925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human brains integrate linguistic and perceptual information simultaneously
to understand natural language, and hold the critical ability to render
imaginations. Such abilities enable us to construct new abstract concepts or
concrete objects, and are essential in involving practical knowledge to solve
problems in low-resource scenarios. However, most existing methods for Natural
Language Understanding (NLU) are mainly focused on textual signals. They do not
simulate human visual imagination ability, which hinders models from inferring
and learning efficiently from limited data samples. Therefore, we introduce an
Imagination-Augmented Cross-modal Encoder (iACE) to solve natural language
understanding tasks from a novel learning perspective -- imagination-augmented
cross-modal understanding. iACE enables visual imagination with external
knowledge transferred from the powerful generative and pre-trained
vision-and-language models. Extensive experiments on GLUE and SWAG show that
iACE achieves consistent improvement over visually-supervised pre-trained
models. More importantly, results in extreme and normal few-shot settings
validate the effectiveness of iACE in low-resource natural language
understanding circumstances.
Related papers
- Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics [8.749640179057469]
We use linear decoding over the learned representations of unimodal vision, unimodal language, and multimodal deep neural network (DNN) models to predict human beauty ratings of naturalistic images.
We show that unimodal vision models (e.g. SimCLR) account for the vast majority of explainable variance in these ratings. Language-aligned vision models (e.g. SLIP) yield small gains relative to unimodal vision.
Taken together, these results suggest that whatever words we may eventually find to describe our experience of beauty, the ineffable computations of feedforward perception may provide sufficient foundation for that experience.
arXiv Detail & Related papers (2024-10-31T03:37:21Z) - Contextual Emotion Recognition using Large Vision Language Models [0.6749750044497732]
Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision.
In this paper, we examine two major approaches enabled by recent large vision language models.
We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
arXiv Detail & Related papers (2024-05-14T23:24:12Z) - Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks.
We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces.
VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z) - Analyzing the Roles of Language and Vision in Learning from Limited Data [31.895396236504993]
We study the contributions that language and vision make to learning about the world.
We find that a language model leveraging all components recovers a majority of a Vision-Language Model's performance.
arXiv Detail & Related papers (2024-02-15T22:19:41Z) - Tackling Vision Language Tasks Through Learning Inner Monologues [10.795616787372625]
We propose a novel approach, Inner Monologue Multi-Modal Optimization (IMMO), to solve complex vision language problems.
IMMO simulates inner monologue processes, a cognitive process in which an individual engages in silent verbal communication with themselves.
The results suggest IMMO can enhance reasoning and explanation abilities, contributing to the more effective fusion of vision and language models.
arXiv Detail & Related papers (2023-08-19T10:10:49Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination [57.49336064527538]
We develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities.
We leverage two complementary types of ''imaginations'': (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation.
Jointly exploiting the language inputs and the imagination, a pretrained vision-language model eventually composes a zero-shot solution to the original language tasks.
arXiv Detail & Related papers (2022-10-21T21:33:10Z) - Visualizing and Explaining Language Models [0.0]
Natural Language Processing has become, after Computer Vision, the second field of Artificial Intelligence.
This paper showcases the techniques used in some of the most popular Deep Learning for NLP visualizations, with a special focus on interpretability and explainability.
arXiv Detail & Related papers (2022-04-30T17:23:33Z) - WenLan 2.0: Make AI Imagine via a Multimodal Foundation Model [74.4875156387271]
We develop a novel foundation model pre-trained with huge multimodal (visual and textual) data.
We show that state-of-the-art results can be obtained on a wide range of downstream tasks.
arXiv Detail & Related papers (2021-10-27T12:25:21Z) - Vision and Language: from Visual Perception to Content Creation [100.36776435627962]
"vision to language" is probably one of the most popular topics in the past five years.
This paper reviews the recent advances along these two dimensions: "vision to language" and "language to vision"
arXiv Detail & Related papers (2019-12-26T14:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.