Large Language Model Evaluation via Matrix Entropy
- URL: http://arxiv.org/abs/2401.17139v1
- Date: Tue, 30 Jan 2024 16:19:55 GMT
- Title: Large Language Model Evaluation via Matrix Entropy
- Authors: Lai Wei, Zhiquan Tan, Chenghai Li, Jindong Wang, Weiran Huang
- Abstract summary: We introduce matrix entropy, a novel metric rooted in information theory and geometry principles to quantify the data compression proficiency in large language models (LLMs)
For language models, our findings reveal that the matrix entropy of representations follows a scaling law type reduction when the model scales up, serving as a complement to the traditional loss scaling law.
For the multi-modal setting, we also propose an evaluation method based on matrix entropy for assessing alignment quality.
- Score: 11.455818555226942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have revolutionized the field of natural
language processing, extending their strong capabilities into multi-modal
domains. Thus, it is vital to define proper and diversified metrics for the
evaluation of LLMs.
In this paper, we introduce matrix entropy, a novel metric rooted in
information theory and geometry principles to quantify the data compression
proficiency in LLMs. It reflects the model's ability to extract relevant
information and eliminate unnecessary elements, thereby providing insight into
the language model's intrinsic capability. Specifically, we demonstrate its
applicability in both single-modal (language) and multi-modal settings. For
language models, our findings reveal that the matrix entropy of representations
follows a scaling law type reduction when the model scales up, serving as a
complement to the traditional loss scaling law. For the multi-modal setting, we
also propose an evaluation method based on matrix entropy for assessing
alignment quality and we find that modern large multi-modal models exhibit
great alignment performance.
Related papers
- Discrete Diffusion in Large Language and Multimodal Models: A Survey [56.31088116526825]
We provide a systematic survey of Discrete Diffusion Language Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs)<n>Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token, parallel decoding paradigm.<n>We trace the historical development of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and categorize representative models.
arXiv Detail & Related papers (2025-06-16T17:59:08Z) - CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships? [5.246809683975664]
This study emphasizes the need to move beyond similarity-based metrics and adopt a discourse-driven framework for evaluating MLLMs.
Our benchmark, CORDIAL, encompasses a broad spectrum of Coherence Relations across 3 different discourse domains at varying levels of granularity.
arXiv Detail & Related papers (2025-02-16T22:54:44Z) - RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering [9.915889321513678]
RAMQA is a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques.
Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations.
arXiv Detail & Related papers (2025-01-23T00:50:33Z) - Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.
Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.
We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench [17.73279547506514]
We introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning.
MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives.
Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation and cloze tasks, while multimodal unlearning approaches perform better in classification tasks with multimodal inputs.
arXiv Detail & Related papers (2024-10-29T15:07:23Z) - LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models [1.3108652488669736]
We propose a novel pruning method utilising centrality measures from graph theory, reducing both the computational requirements and the memory footprint of these models.
We call this pruning method PageRankRank. Furthermore we introduce an extension to decoder-only transformer models and call it LLMRank.
arXiv Detail & Related papers (2024-10-17T07:55:47Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - What do Large Language Models Need for Machine Translation Evaluation? [12.42394213466485]
Large language models (LLMs) can achieve results comparable to fine-tuned multilingual pre-trained language models.
This paper explores what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate machine translation quality.
arXiv Detail & Related papers (2024-10-04T09:50:45Z) - Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach [0.0]
Large Language Models (LLMs) produce inaccurate outputs, also known as hallucinations.
This paper introduces a supervised learning approach employing only four numerical features derived from tokens and vocabulary probabilities obtained from other evaluators.
The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks.
arXiv Detail & Related papers (2024-05-30T03:00:47Z) - FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z) - LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and
Generative Fusion [33.73671362609599]
Our framework consists of two modules: PairRanker and GenFuser.
PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs.
GenFuser aims to merge the top-ranked candidates, generating an improved output.
arXiv Detail & Related papers (2023-06-05T03:32:26Z) - Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data.
Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.