Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large
Language Models
- URL: http://arxiv.org/abs/2312.01714v2
- Date: Sun, 3 Mar 2024 06:12:44 GMT
- Title: Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large
Language Models
- Authors: Bingshuai Liu, Chenyang Lyu, Zijun Min, Zhanyu Wang, Jinsong Su,
Longyue Wang
- Abstract summary: Chain of Thought (CoT) approaches can be used to enhance the capability of Large Language Models (LLMs) on complex reasoning tasks.
However, the selection of optimal CoT demonstration examples in multi-modal reasoning remains less explored.
We introduce a novel approach that addresses this challenge by using retrieval mechanisms to automatically select demonstration examples.
- Score: 56.256069117502385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advancement of Large Language Models (LLMs) has brought substantial
attention to the Chain of Thought (CoT) approach, primarily due to its ability
to enhance the capability of LLMs on complex reasoning tasks. Moreover, the
significance of CoT approaches extends to the application of LLMs for
multi-modal tasks. However, the selection of optimal CoT demonstration examples
in multi-modal reasoning remains less explored for LLMs due to the inherent
complexity of multi-modal examples. In this paper, we introduce a novel
approach that addresses this challenge by using retrieval mechanisms to
dynamically and automatically select demonstration examples based on
cross-modal and intra-modal similarities. Furthermore, we employ a Stratified
Sampling method of categorising demonstration examples into groups based on
their types and then retrieving examples from different groups respectively to
promote the diversity of demonstration examples. Through a series of
experiments on two popular benchmark datasets: ScienceQA and MathVista, we
demonstrate that our approach significantly improves the performance of GPT-4
by 6% on ScienceQA and 12.9% on MathVista, and enhances the performance of
GPT-4V on two datasets by 2.7%, substantially improving the performance of the
most advanced LLMs and LMMs for complex multi-modal reasoning tasks.
Related papers
- Utilizing Large Language Models for Event Deconstruction to Enhance Multimodal Aspect-Based Sentiment Analysis [2.1329326061804816]
This paper introduces Large Language Models (LLMs) for event decomposition and proposes a reinforcement learning framework for Multimodal Aspect-based Sentiment Analysis (MABSA-RL)
Experimental results show that MABSA-RL outperforms existing advanced methods on two benchmark datasets.
arXiv Detail & Related papers (2024-10-18T03:40:45Z) - Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available.
We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data.
Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - Large Language Models Know What Makes Exemplary Contexts [42.90814615222177]
In-context learning (ICL) has proven to be a significant capability with the advancement of Large Language models (LLMs)
This paper presents a unified framework for LLMs that allows them to self-select influential in-context examples to compose their contexts.
arXiv Detail & Related papers (2024-08-14T12:32:41Z) - From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning [47.82447085244952]
We show that modalities matter differently across tasks in multimodal ICL.
Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance.
arXiv Detail & Related papers (2024-07-01T01:57:21Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.