Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
- URL: http://arxiv.org/abs/2410.03062v1
- Date: Fri, 4 Oct 2024 00:55:15 GMT
- Title: Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
- Authors: Grant Wardle, Teo Susnjak,
- Abstract summary: This paper examines how the sequencing of images and text within multi-modal prompts influences the reasoning performance of large language models (LLMs)
For simpler tasks involving a single image, modality sequencing had a clear impact on accuracy.
In more complex tasks involving multiple images and intricate reasoning steps, the effect of sequencing diminished, likely due to the increased cognitive demands of the task.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper examines how the sequencing of images and text within multi-modal prompts influences the reasoning performance of large language models (LLMs). We performed empirical evaluations using three commercial LLMs. Our results demonstrate that the order in which modalities are presented can significantly affect performance, particularly in tasks of varying complexity. For simpler tasks involving a single image, modality sequencing had a clear impact on accuracy. However, in more complex tasks involving multiple images and intricate reasoning steps, the effect of sequencing diminished, likely due to the increased cognitive demands of the task. Our findings also highlight the importance of question/prompt structure. In nested and multi-step reasoning tasks, modality sequencing played a key role in shaping model performance. While LLMs excelled in the initial stages of reasoning, they struggled to re-incorporate earlier information, underscoring the challenges of multi-hop reasoning within transformer architectures. This suggests that aligning the sequence of modalities with the logical flow of reasoning steps is more critical than modality order alone. These insights offer valuable implications for improving multi-modal prompt design, with broader applications across fields such as education, medical imaging, and cross-modal learning.
Related papers
- Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability [12.349247962800813]
Large language models (LLMs) have emerged as powerful tools for many AI problems.
They exhibit remarkable in-context learning (ICL) capabilities.
How they approach composite tasks remains an open and largely underexplored question.
arXiv Detail & Related papers (2024-07-22T15:22:34Z) - Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models [14.765057045747753]
Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks.
We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step.
IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs.
arXiv Detail & Related papers (2024-05-22T17:56:51Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Self-Convinced Prompting: Few-Shot Question Answering with Repeated
Introspection [13.608076739368949]
We introduce a novel framework that harnesses the potential of large-scale pre-trained language models.
Our framework processes the output of a typical few-shot chain-of-thought prompt, assesses the correctness of the response, scrutinizes the answer, and ultimately produces a new solution.
arXiv Detail & Related papers (2023-10-08T06:36:26Z) - Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks.
These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer.
Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z) - Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models [81.01397924280612]
Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations.
We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains.
arXiv Detail & Related papers (2023-04-23T13:54:39Z) - Active Prompting with Chain-of-Thought for Large Language Models [26.5029080638055]
This paper proposes a new method, Active-Prompt, to adapt large language models to different tasks.
By borrowing ideas from the related problem of uncertainty-based active learning, we introduce several metrics to characterize the uncertainty.
Experimental results demonstrate the superiority of our proposed method, achieving state-of-the-art on eight complex reasoning tasks.
arXiv Detail & Related papers (2023-02-23T18:58:59Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented
Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions.
We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD.
Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.