Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning
- URL: http://arxiv.org/abs/2310.00647v2
- Date: Mon, 22 Jan 2024 18:53:48 GMT
- Title: Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning
- Authors: Mustafa Shukor, Alexandre Rame, Corentin Dancette, Matthieu Cord
- Abstract summary: We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following.
We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations.
Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
- Score: 105.77733287326308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following the success of Large Language Models (LLMs), Large Multimodal
Models (LMMs), such as the Flamingo model and its subsequent competitors, have
started to emerge as natural steps towards generalist agents. However,
interacting with recent LMMs reveals major limitations that are hardly captured
by the current evaluation benchmarks. Indeed, task performances (e.g., VQA
accuracy) alone do not provide enough clues to understand their real
capabilities, limitations, and to which extent such models are aligned to human
expectations. To refine our understanding of those flaws, we deviate from the
current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from
3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention,
compositionality, explainability and instruction following. Our evaluation on
these axes reveals major flaws in LMMs. While the current go-to solution to
align these models is based on training, such as instruction tuning or RLHF, we
rather (2) explore the training-free in-context learning (ICL) as a solution,
and study how it affects these limitations. Based on our ICL study, (3) we push
ICL further and propose new multimodal ICL variants such as; Multitask-ICL,
Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows.
(1) Despite their success, LMMs have flaws that remain unsolved with scaling
alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its
effectiveness for improved explainability, answer abstention, ICL only slightly
improves instruction following, does not improve compositional abilities, and
actually even amplifies hallucinations. (3) The proposed ICL variants are
promising as post-hoc approaches to efficiently tackle some of those flaws. The
code is available here: https://github.com/mshukor/EvALign-ICL.
Related papers
- Learning to Refine with Fine-Grained Natural Language Feedback [81.70313509881315]
We propose looking at refinement with feedback as a composition of three distinct LLM competencies.
A key property of this approach is that the step 2 critique model can give fine-grained feedback about errors.
We show that models of different capabilities benefit from refining with this approach on the task of improving factual consistency of document grounded summaries.
arXiv Detail & Related papers (2024-07-02T16:15:01Z) - ICLEval: Evaluating In-Context Learning Ability of Large Language Models [68.7494310749199]
In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs.
Existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability.
We introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning.
arXiv Detail & Related papers (2024-06-21T08:06:10Z) - UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - DEEP-ICL: Definition-Enriched Experts for Language Model In-Context Learning [37.22553531518853]
It has long been assumed that the sheer number of parameters in large language models (LLMs) drives in-context learning (ICL) capabilities.
We introduce DEEP-ICL, a novel task Definition Enriched ExPert Ensembling methodology for ICL.
We argue that improvement from ICL does not directly rely on model size, but essentially stems from understanding task definitions and task-guided learning.
arXiv Detail & Related papers (2024-03-07T05:26:41Z) - Mind's Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models [20.28989820878285]
Large language models (LLMs) have achieved remarkable advancements in natural language processing.
The massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments.
arXiv Detail & Related papers (2023-11-15T18:56:23Z) - Using Natural Language Explanations to Improve Robustness of In-context Learning [35.18010811754959]
Large language models (LLMs) can excel in many tasks via in-context learning (ICL)
We investigate whether augmenting ICL with natural language explanations (NLEs) improves the robustness of LLMs on adversarial datasets.
arXiv Detail & Related papers (2023-11-13T18:49:13Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.