Beyond Language: Learning Commonsense from Images for Reasoning
- URL: http://arxiv.org/abs/2010.05001v1
- Date: Sat, 10 Oct 2020 13:47:13 GMT
- Title: Beyond Language: Learning Commonsense from Images for Reasoning
- Authors: Wanqing Cui, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng
- Abstract summary: This paper proposes a novel approach to learn commonsense from images, instead of limited raw texts or costly constructed knowledge bases.
Our motivation comes from the fact that an image is worth a thousand words, where richer scene information could be leveraged to help distill the commonsense knowledge.
- Score: 78.33934895163736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel approach to learn commonsense from images,
instead of limited raw texts or costly constructed knowledge bases, for the
commonsense reasoning problem in NLP. Our motivation comes from the fact that
an image is worth a thousand words, where richer scene information could be
leveraged to help distill the commonsense knowledge, which is often hidden in
languages. Our approach, namely Loire, consists of two stages. In the first
stage, a bi-modal sequence-to-sequence approach is utilized to conduct the
scene layout generation task, based on a text representation model ViBERT. In
this way, the required visual scene knowledge, such as spatial relations, will
be encoded in ViBERT by the supervised learning process with some bi-modal data
like COCO. Then ViBERT is concatenated with a pre-trained language model to
perform the downstream commonsense reasoning tasks. Experimental results on two
commonsense reasoning problems, i.e. commonsense question answering and pronoun
resolution, demonstrate that Loire outperforms traditional language-based
methods. We also give some case studies to show what knowledge is learned from
images and explain how the generated scene layout helps the commonsense
reasoning process.
Related papers
- WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling
Vision-Language Models Through Open-Vocabulary Knowledge [12.034917651508524]
$texttWAVER$ is a cross-domain knowledge distillation framework via vision-language models.
$texttWAVER$ capitalizes on the open-vocabulary properties that lie in pre-trained vision-language models.
It can achieve state-of-the-art performance in text-video retrieval task while handling writing-style variations.
arXiv Detail & Related papers (2023-12-15T03:17:37Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - Implicit and Explicit Commonsense for Multi-sentence Video Captioning [33.969215964292395]
We propose a novel video captioning Transformer-based model that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge.
We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions.
arXiv Detail & Related papers (2023-03-14T00:19:11Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Transferring Knowledge from Vision to Language: How to Achieve it and
how to Measure it? [0.0]
We propose a method for evaluating visual knowledge transfer to text for uni- or multimodal language models.
We find that our method can successfully be used to measure visual knowledge transfer capabilities in models.
arXiv Detail & Related papers (2021-09-23T12:11:23Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.