HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model
- URL: http://arxiv.org/abs/2412.14613v1
- Date: Thu, 19 Dec 2024 08:03:16 GMT
- Title: HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model
- Authors: Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue,
- Abstract summary: Vision-language models (VLMs) have shown impressive abilities in text and image understanding.
Existing metrics for evaluating the text generated by VLMs focus exclusively on overall quality.
We propose HarmonicEval, a reference-free evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner.
- Score: 42.62148712511799
- License:
- Abstract: Vision-language models (VLMs) have shown impressive abilities in text and image understanding. However, existing metrics for evaluating the text generated by VLMs focus exclusively on overall quality, leading to two limitations: 1) it is challenging to identify which aspects of the text need improvement from the overall score; 2) metrics may overlook specific evaluation criteria when predicting an overall score. To address these limitations, we propose HarmonicEval, a reference-free evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four vision-language tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.
Related papers
- Multilingual European Language Models: Benchmarking Approaches and Challenges [2.413212225810367]
generative large language models (LLMs) can solve different tasks through chat interaction.
This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks.
We discuss potential solutions to enhance translation quality and cultural biases, including human-in-the-loop verification and iterative translation ranking.
arXiv Detail & Related papers (2025-02-18T14:32:17Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - Large Language Models Are Active Critics in NLG Evaluation [9.932334723464129]
Active-Critic is a novel evaluator that transforms large language models (LLMs) into "active critics"
Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria.
arXiv Detail & Related papers (2024-10-14T17:04:41Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Assessment of Multimodal Large Language Models in Alignment with Human Values [43.023052912326314]
We introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations.
Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle.
arXiv Detail & Related papers (2024-03-26T16:10:21Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.
This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)
We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Perturbation CheckLists for Evaluating NLG Evaluation Metrics [16.20764980129339]
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria.
Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated.
This suggests that the current recipe of proposing new automatic evaluation metrics for NLG is inadequate.
arXiv Detail & Related papers (2021-09-13T08:26:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.