Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning
- URL: http://arxiv.org/abs/2310.00647v2
- Date: Mon, 22 Jan 2024 18:53:48 GMT
- Title: Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning
- Authors: Mustafa Shukor, Alexandre Rame, Corentin Dancette, Matthieu Cord
- Abstract summary: We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following.
We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations.
Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
- Score: 105.77733287326308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Following the success of Large Language Models (LLMs), Large Multimodal
Models (LMMs), such as the Flamingo model and its subsequent competitors, have
started to emerge as natural steps towards generalist agents. However,
interacting with recent LMMs reveals major limitations that are hardly captured
by the current evaluation benchmarks. Indeed, task performances (e.g., VQA
accuracy) alone do not provide enough clues to understand their real
capabilities, limitations, and to which extent such models are aligned to human
expectations. To refine our understanding of those flaws, we deviate from the
current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from
3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention,
compositionality, explainability and instruction following. Our evaluation on
these axes reveals major flaws in LMMs. While the current go-to solution to
align these models is based on training, such as instruction tuning or RLHF, we
rather (2) explore the training-free in-context learning (ICL) as a solution,
and study how it affects these limitations. Based on our ICL study, (3) we push
ICL further and propose new multimodal ICL variants such as; Multitask-ICL,
Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows.
(1) Despite their success, LMMs have flaws that remain unsolved with scaling
alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its
effectiveness for improved explainability, answer abstention, ICL only slightly
improves instruction following, does not improve compositional abilities, and
actually even amplifies hallucinations. (3) The proposed ICL variants are
promising as post-hoc approaches to efficiently tackle some of those flaws. The
code is available here: https://github.com/mshukor/EvALign-ICL.
Related papers
- LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.
We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.
LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z) - Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs [12.48241058167222]
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions.
But studies reveal that they often struggle with tasks requiring reasoning, such as math or physics limitation.
This raises questions about whether LLMs truly comprehend embedded knowledge or merely learn to replicate the token distribution without a true understanding of the content.
We propose Decon Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T13:17:09Z) - MICM: Rethinking Unsupervised Pretraining for Enhanced Few-shot Learning [18.152453141040464]
Unsupervised Few-Shot Learning seeks to bridge this divide by reducing reliance on annotated datasets during initial training phases.
We first quantitatively assess the impacts of Masked Image Modeling (MIM) and Contrastive Learning (CL) on few-shot learning tasks.
To address these trade-offs between generalization and discriminability in unsupervised pretraining, we introduce a novel paradigm named Masked Image Contrastive Modeling (MICM)
arXiv Detail & Related papers (2024-08-23T21:32:53Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Do Emergent Abilities Exist in Quantized Large Language Models: An
Empirical Study [90.34226812493083]
This work aims to investigate the impact of quantization on emphemergent abilities, which are important characteristics that distinguish LLMs from small language models.
Our empirical experiments show that these emergent abilities still exist in 4-bit quantization models, while 2-bit models encounter severe performance degradation.
To improve the performance of low-bit models, we conduct two special experiments: (1) fine-gained impact analysis that studies which components (or substructures) are more sensitive to quantization, and (2) performance compensation through model fine-tuning.
arXiv Detail & Related papers (2023-07-16T15:11:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.