A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues
- URL: http://arxiv.org/abs/2305.04530v1
- Date: Mon, 8 May 2023 08:05:40 GMT
- Title: A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues
- Authors: Yunxin Li, Baotian Hu, Xinyu Chen, Yuxin Ding, Lin Ma, and Min Zhang
- Abstract summary: Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
- Score: 23.743431157431893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conditional inference on joint textual and visual clues is a multi-modal
reasoning task that textual clues provide prior permutation or external
knowledge, which are complementary with visual content and pivotal to deducing
the correct option. Previous methods utilizing pretrained vision-language
models (VLMs) have achieved impressive performances, yet they show a lack of
multimodal context reasoning capability, especially for text-modal information.
To address this issue, we propose a Multi-modal Context Reasoning approach,
named ModCR. Compared to VLMs performing reasoning via cross modal semantic
alignment, it regards the given textual abstract semantic and objective image
information as the pre-context information and embeds them into the language
model to perform context reasoning. Different from recent vision-aided language
models used in natural language processing, ModCR incorporates the multi-view
semantic alignment information between language and vision by introducing the
learnable alignment prefix between image and text in the pretrained language
model. This makes the language model well-suitable for such multi-modal
reasoning scenario on joint textual and visual clues. We conduct extensive
experiments on two corresponding data sets and experimental results show
significantly improved performance (exact gain by 4.8% on PMR test set)
compared to previous strong baselines. Code Link:
\url{https://github.com/YunxinLi/Multimodal-Context-Reasoning}.
Related papers
- Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation [5.528860524494717]
This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation.
By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved.
arXiv Detail & Related papers (2024-10-04T04:59:50Z) - Improving Visual Commonsense in Language Models via Multiple Image Generation [41.565399860320966]
Existing large language models (LLMs) are primarily trained using textual data only.
Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning.
arXiv Detail & Related papers (2024-06-19T15:17:10Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles.
Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information.
We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z) - Autoregressive Pre-Training on Pixels and Texts [35.82610192457444]
We explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts.
Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head.
We find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks.
arXiv Detail & Related papers (2024-04-16T16:36:50Z) - FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues [20.587249765287183]
Feature Swapping Multi-modal Reasoning (FSMR) model is designed to enhance multi-modal reasoning through feature swapping.
FSMR incorporates a multi-modal cross-attention mechanism, facilitating the joint modeling of textual and visual information.
Experiments on the PMR dataset demonstrate FSMR's superiority over state-of-the-art baseline models.
arXiv Detail & Related papers (2024-03-29T07:28:50Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.