Chain-of-Thought Prompt Distillation for Multimodal Named Entity
Recognition and Multimodal Relation Extraction
- URL: http://arxiv.org/abs/2306.14122v3
- Date: Wed, 23 Aug 2023 05:04:58 GMT
- Title: Chain-of-Thought Prompt Distillation for Multimodal Named Entity
Recognition and Multimodal Relation Extraction
- Authors: Feng Chen and Yujian Feng
- Abstract summary: We generate a textitchain of thought (CoT) -- a sequence of intermediate reasoning steps.
We present a novel conditional prompt distillation method to assimilate the commonsense reasoning ability from large language models.
Our approach attains state-of-the-art accuracy and manifests a plethora of advantages concerning interpretability, data efficiency, and cross-domain generalization.
- Score: 8.169359626365619
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction
(MRE) necessitate the fundamental reasoning capacity for intricate linguistic
and multimodal comprehension. In this study, we explore distilling the
reasoning ability of large language models (LLMs) into a more compact student
model by generating a \textit{chain of thought} (CoT) -- a sequence of
intermediate reasoning steps. Specifically, we commence by exemplifying the
elicitation of such reasoning ability from LLMs through CoT prompts covering
multi-grain (noun, sentence, multimodality) and data-augmentation (style,
entity, image) dimensions. Subsequently, we present a novel conditional prompt
distillation method to assimilate the commonsense reasoning ability from LLMs,
thereby enhancing the utility of the student model in addressing text-only
inputs without the requisite addition of image and CoT knowledge. Extensive
experiments reveal that our approach attains state-of-the-art accuracy and
manifests a plethora of advantages concerning interpretability, data
efficiency, and cross-domain generalization on MNER and MRE datasets.
Related papers
- Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.
EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.
Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)
We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.
We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning [21.127950337002776]
Multimodal Sentiment Analysis (MSA) is an important research area that aims to understand and recognize human sentiment through multiple modalities.
We propose a Hierarchical Representation Learning Framework (HRLF) for the task under uncertain missing modalities.
We show that HRLF significantly improves MSA performance under uncertain modality missing cases.
arXiv Detail & Related papers (2024-11-05T04:04:41Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning)
The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs.
It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning
in Language Models [28.712359821231182]
Large language models (LLMs) have made remarkable strides in such multi-step reasoning on the language modality solely by leveraging the chain of thought (CoT) to mimic human thinking.
The transfer of these advancements to multimodal contexts introduces heightened challenges, including but not limited to the impractical need for labor-intensive annotation.
This study proposes a novel DDCoT prompting that maintains a critical attitude through negative-space prompting and incorporates multimodality into reasoning.
arXiv Detail & Related papers (2023-10-25T08:03:10Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.