Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning
- URL: http://arxiv.org/abs/2310.00647v2
- Date: Mon, 22 Jan 2024 18:53:48 GMT
- Title: Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning
- Authors: Mustafa Shukor, Alexandre Rame, Corentin Dancette, Matthieu Cord
- Abstract summary: We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following.
We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations.
Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
- Score: 105.77733287326308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following the success of Large Language Models (LLMs), Large Multimodal
Models (LMMs), such as the Flamingo model and its subsequent competitors, have
started to emerge as natural steps towards generalist agents. However,
interacting with recent LMMs reveals major limitations that are hardly captured
by the current evaluation benchmarks. Indeed, task performances (e.g., VQA
accuracy) alone do not provide enough clues to understand their real
capabilities, limitations, and to which extent such models are aligned to human
expectations. To refine our understanding of those flaws, we deviate from the
current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from
3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention,
compositionality, explainability and instruction following. Our evaluation on
these axes reveals major flaws in LMMs. While the current go-to solution to
align these models is based on training, such as instruction tuning or RLHF, we
rather (2) explore the training-free in-context learning (ICL) as a solution,
and study how it affects these limitations. Based on our ICL study, (3) we push
ICL further and propose new multimodal ICL variants such as; Multitask-ICL,
Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows.
(1) Despite their success, LMMs have flaws that remain unsolved with scaling
alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its
effectiveness for improved explainability, answer abstention, ICL only slightly
improves instruction following, does not improve compositional abilities, and
actually even amplifies hallucinations. (3) The proposed ICL variants are
promising as post-hoc approaches to efficiently tackle some of those flaws. The
code is available here: https://github.com/mshukor/EvALign-ICL.
Related papers
- The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio [118.75449542080746]
This paper presents the first systematic investigation of hallucinations in large multimodal models (LMMs)
Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations.
Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning.
arXiv Detail & Related papers (2024-10-16T17:59:02Z) - Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z) - Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs [12.48241058167222]
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions.
But studies reveal that they often struggle with tasks requiring reasoning, such as math or physics limitation.
This raises questions about whether LLMs truly comprehend embedded knowledge or merely learn to replicate the token distribution without a true understanding of the content.
We propose Decon Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T13:17:09Z) - MICM: Rethinking Unsupervised Pretraining for Enhanced Few-shot Learning [18.152453141040464]
Unsupervised Few-Shot Learning seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases.
We first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks.
To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM)
arXiv Detail & Related papers (2024-08-23T21:32:53Z) - ICLEval: Evaluating In-Context Learning Ability of Large Language Models [68.7494310749199]
In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs.
Existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability.
We introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning.
arXiv Detail & Related papers (2024-06-21T08:06:10Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.