Related papers: Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

URL: http://arxiv.org/abs/2402.13887v2
Date: Tue, 9 Jul 2024 10:46:29 GMT
Title: Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
Authors: Chenyang Lyu, Minghao Wu, Alham Fikri Aji,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications. This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs) Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction.
Score: 24.445829787297658
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, fundamentally reshaping the landscape of natural language processing (NLP) research. However, recent evaluation frameworks often rely on the output probabilities of LLMs for predictions, primarily due to computational constraints, diverging from real-world LLM usage scenarios. While widely employed, the efficacy of these probability-based evaluation strategies remains an open research question. This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs), highlighting their inherent limitations. Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction. Furthermore, current evaluation frameworks typically assess LLMs through predictive tasks based on output probabilities rather than directly generating responses, owing to computational limitations. We illustrate that these probability-based approaches do not effectively correspond with generative predictions. The outcomes of our study can enhance the understanding of LLM evaluation methodologies and provide insights for future research in this domain.

Related papers

A comparative analysis of machine learning algorithms for predicting probabilities of default [1.534667887016089]
Predicting the probability of default (PD) of prospective loans is a critical objective for financial institutions.<n>In recent years, machine learning (ML) algorithms have achieved remarkable success across a wide variety of prediction tasks.<n>This paper highlights the opportunities that ML algorithms offer to this field by comparing the performance of five predictive models.
arXiv Detail & Related papers (2025-06-24T16:56:07Z)
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges [13.526258635654882]
This study introduces a Bayesian approach for large language models (LLMs) capability assessment. We treat model capabilities as latent variables and leverage a curated query set to induce discriminative responses. Experimental evaluations with GPT-series models demonstrate that the proposed method achieves superior discrimination compared to conventional evaluation methods.
arXiv Detail & Related papers (2025-04-30T04:24:50Z)
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework in Large Language Models (LLMs) We derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z)
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers [13.644277507363036]
We investigate whether these abilities are measurable outside of tailored prompting and MCQ. Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer. As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture.
arXiv Detail & Related papers (2024-06-21T08:56:35Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks. Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs. We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language [35.84181171987974]
Our goal is to build a regression model that can process numerical data and make probabilistic predictions at arbitrary locations. We start by exploring strategies for eliciting explicit, coherent numerical predictive distributions from Large Language Models. We demonstrate the ability to usefully incorporate text into numerical predictions, improving predictive performance and giving quantitative structure that reflects qualitative descriptions.
arXiv Detail & Related papers (2024-05-21T15:13:12Z)
Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs) It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks. In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework. We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z)
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models [21.438427686724932]
We introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios. Our benchmark targets three important domains: abstract pattern reasoning, human activity prediction, and physical interaction prediction. Empirical experiments confirm the soundness of the proposed benchmark and evaluation methods.
arXiv Detail & Related papers (2023-10-20T13:14:38Z)
Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues. We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders. We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.