The Eval4NLP 2023 Shared Task on Prompting Large Language Models as
Explainable Metrics
- URL: http://arxiv.org/abs/2310.19792v1
- Date: Mon, 30 Oct 2023 17:55:08 GMT
- Title: The Eval4NLP 2023 Shared Task on Prompting Large Language Models as
Explainable Metrics
- Authors: Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror,
Steffen Eger
- Abstract summary: generative large language models (LLMs) have shown remarkable capabilities to solve tasks with minimal or no task-related examples.
We introduce the Eval4NLP 2023 shared task that asks participants to explore prompting and score extraction for machine translation (MT) and summarization evaluation.
We present an overview of participants' approaches and evaluate them on a new reference-free test set spanning three language pairs for MT and a summarization dataset.
- Score: 36.52897053496835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With an increasing number of parameters and pre-training data, generative
large language models (LLMs) have shown remarkable capabilities to solve tasks
with minimal or no task-related examples. Notably, LLMs have been successfully
employed as evaluation metrics in text generation tasks. Within this context,
we introduce the Eval4NLP 2023 shared task that asks participants to explore
prompting and score extraction for machine translation (MT) and summarization
evaluation. Specifically, we propose a novel competition setting in which we
select a list of allowed LLMs and disallow fine-tuning to ensure a focus on
prompting. We present an overview of participants' approaches and evaluate them
on a new reference-free test set spanning three language pairs for MT and a
summarization dataset. Notably, despite the task's restrictions, the
best-performing systems achieve results on par with or even surpassing recent
reference-free metrics developed using larger models, including GEMBA and
Comet-Kiwi-XXL. Finally, as a separate track, we perform a small-scale human
evaluation of the plausibility of explanations given by the LLMs.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Meta-Task Prompting Elicits Embeddings from Large Language Models [54.757445048329735]
We introduce a new unsupervised text embedding method, Meta-Task Prompting with Explicit One-Word Limitation.
We generate high-quality sentence embeddings from Large Language Models without the need for model fine-tuning.
Our findings suggest a new scaling law, offering a versatile and resource-efficient approach for embedding generation across diverse scenarios.
arXiv Detail & Related papers (2024-02-28T16:35:52Z) - Exploring Prompting Large Language Models as Explainable Metrics [0.0]
We propose a zero-shot prompt-based strategy for explainable evaluation of the summarization task using Large Language Models (LLMs)
The conducted experiments demonstrate the promising potential of LLMs as evaluation metrics in Natural Language Processing (NLP)
The performance of our best provided prompts achieved a Kendall correlation of 0.477 with human evaluations in the text summarization task on the test data.
arXiv Detail & Related papers (2023-11-20T06:06:22Z) - Little Giants: Exploring the Potential of Small LLMs as Evaluation
Metrics in Summarization in the Eval4NLP 2023 Shared Task [53.163534619649866]
This paper focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation.
We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting.
Our work reveals that combining these approaches using a "small", open source model (orca_mini_v3_7B) yields competitive results.
arXiv Detail & Related papers (2023-11-01T17:44:35Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.