The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large
Language Models
- URL: http://arxiv.org/abs/2401.12117v2
- Date: Tue, 13 Feb 2024 07:04:30 GMT
- Title: The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large
Language Models
- Authors: Kian Ahrabian, Zhivar Sourati, Kexuan Sun, Jiarui Zhang, Yifan Jiang,
Fred Morstatter, Jay Pujara
- Abstract summary: Multi-modal large language models (MLLMs) integrate verbal and visual information.
Despite the revolutionizing prospect of MLLMs, our understanding of their reasoning abilities is limited.
- Score: 20.177263185773153
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While large language models (LLMs) are still being adopted to new domains and
utilized in novel applications, we are experiencing an influx of the new
generation of foundation models, namely multi-modal large language models
(MLLMs). These models integrate verbal and visual information, opening new
possibilities to demonstrate more complex reasoning abilities at the
intersection of the two modalities. However, despite the revolutionizing
prospect of MLLMs, our understanding of their reasoning abilities is limited.
In this study, we assess the nonverbal abstract reasoning abilities of
open-source and closed-source MLLMs using variations of Raven's Progressive
Matrices. Our experiments expose the difficulty of solving such problems while
showcasing the immense gap between open-source and closed-source models. We
also reveal critical shortcomings with individual visual and textual modules,
subjecting the models to low-performance ceilings. Finally, to improve MLLMs'
performance, we experiment with various methods, such as Chain-of-Thought
prompting, resulting in a significant (up to 100%) boost in performance.
Related papers
- Explaining Multi-modal Large Language Models by Analyzing their Vision Perception [4.597864989500202]
This research proposes a novel approach to enhance the interpretability of MLLMs by focusing on the image embedding component.
We combine an open-world localization model with a MLLM, thus creating a new architecture able to simultaneously produce text and object localization outputs from the same vision embedding.
arXiv Detail & Related papers (2024-05-23T14:24:23Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual words, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Model Composition for Multimodal Large Language Models [73.70317850267149]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.