Law of the Weakest Link: Cross Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2409.19951v2
- Date: Wed, 2 Oct 2024 22:24:44 GMT
- Title: Law of the Weakest Link: Cross Capabilities of Large Language Models
- Authors: Ming Zhong, Aston Zhang, Xuewei Wang, Rui Hou, Wenhan Xiong, Chenguang Zhu, Zhengxing Chen, Liang Tan, Chloe Bi, Mike Lewis, Sravya Popuri, Sharan Narang, Melanie Kambadur, Dhruv Mahajan, Sergey Edunov, Jiawei Han, Laurens van der Maaten,
- Abstract summary: We show that Large Language Models (LLMs) exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component.
These results highlight the under-performance of LLMs in cross-capability tasks.
- Score: 102.91861246827797
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.
Related papers
- MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs [9.644229985340033]
We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users.
We identify four categories of challenges in multi-turn conversations that are common and realistic among current human-LLM interactions.
Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge.
arXiv Detail & Related papers (2025-01-29T03:29:24Z) - LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0226977561914]
We propose a comprehensive framework for advancing step-by-step visual reasoning in large language models.
We introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks.
Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps.
Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach.
arXiv Detail & Related papers (2025-01-10T18:59:51Z) - AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs [70.4578433679737]
We introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks.
Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs.
The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension.
arXiv Detail & Related papers (2025-01-03T23:03:24Z) - Evaluating and Advancing Multimodal Large Language Models in Ability Lens [30.083110119139793]
We introduce textbfAbilityLens, a unified benchmark designed to evaluate MLLMs across six key perception abilities.
We identify the strengths and weaknesses of current models, highlighting stability patterns and revealing a notable performance gap between open-source and closed-source models.
We also design a simple ability-specific model merging method that combines the best ability checkpoint from early training stages, effectively mitigating performance decline due to ability conflict.
arXiv Detail & Related papers (2024-11-22T04:41:20Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models [14.108788704400643]
GroundCocoa is a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking.
Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format.
Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.
arXiv Detail & Related papers (2024-04-05T17:36:26Z) - PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics
Capabilities [40.55743949223173]
Pragmatics Understanding Benchmark (PUB) is a dataset consisting of fourteen tasks in four pragmatics phenomena.
PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets.
Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller language models.
arXiv Detail & Related papers (2024-01-13T13:46:14Z) - Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web [69.6913064185993]
Language model agents (LMA) emerged as a promising paradigm on muti-step decision making tasks.
Despite the promise, their performance on real-world applications is still underexplored.
We show that while existing LMAs achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks.
arXiv Detail & Related papers (2023-11-30T17:50:47Z) - Better Zero-Shot Reasoning with Role-Play Prompting [10.90357246745529]
Role-play prompting consistently surpasses the standard zero-shot approach across most datasets.
This highlights its potential to augment the reasoning capabilities of large language models.
arXiv Detail & Related papers (2023-08-15T11:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.