Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large
Language Models
- URL: http://arxiv.org/abs/2312.01714v2
- Date: Sun, 3 Mar 2024 06:12:44 GMT
- Title: Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large
Language Models
- Authors: Bingshuai Liu, Chenyang Lyu, Zijun Min, Zhanyu Wang, Jinsong Su,
Longyue Wang
- Abstract summary: Chain of Thought (CoT) approaches can be used to enhance the capability of Large Language Models (LLMs) on complex reasoning tasks.
However, the selection of optimal CoT demonstration examples in multi-modal reasoning remains less explored.
We introduce a novel approach that addresses this challenge by using retrieval mechanisms to automatically select demonstration examples.
- Score: 56.256069117502385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advancement of Large Language Models (LLMs) has brought substantial
attention to the Chain of Thought (CoT) approach, primarily due to its ability
to enhance the capability of LLMs on complex reasoning tasks. Moreover, the
significance of CoT approaches extends to the application of LLMs for
multi-modal tasks. However, the selection of optimal CoT demonstration examples
in multi-modal reasoning remains less explored for LLMs due to the inherent
complexity of multi-modal examples. In this paper, we introduce a novel
approach that addresses this challenge by using retrieval mechanisms to
dynamically and automatically select demonstration examples based on
cross-modal and intra-modal similarities. Furthermore, we employ a Stratified
Sampling method of categorising demonstration examples into groups based on
their types and then retrieving examples from different groups respectively to
promote the diversity of demonstration examples. Through a series of
experiments on two popular benchmark datasets: ScienceQA and MathVista, we
demonstrate that our approach significantly improves the performance of GPT-4
by 6% on ScienceQA and 12.9% on MathVista, and enhances the performance of
GPT-4V on two datasets by 2.7%, substantially improving the performance of the
most advanced LLMs and LMMs for complex multi-modal reasoning tasks.
Related papers
- Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)
We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.
We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Hint Marginalization for Improved Reasoning in Large Language Models [24.67507932821155]
We present Marginalization, a novel and principled algorithmic framework to enhance the reasoning capabilities of Large Language Models (LLMs)
Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers.
Empirical evaluation on several benchmark datasets for arithmetic reasoning demonstrates the superiority of the proposed approach.
arXiv Detail & Related papers (2024-12-17T19:45:53Z) - FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [64.50893177169996]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources.
We introduce a benchmark for evaluating various downstream tasks in the federated fine-tuning of MLLMs within multimodal heterogeneous scenarios.
We develop a general FedMLLM framework that integrates four representative FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z) - Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization process to enhance the multimodal reasoning capabilities of MLLMs.
We develop a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
Our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B.
arXiv Detail & Related papers (2024-11-15T18:59:27Z) - From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning [47.82447085244952]
We show that modalities matter differently across tasks in multimodal ICL.
Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance.
arXiv Detail & Related papers (2024-07-01T01:57:21Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.