BLESS: Benchmarking Large Language Models on Sentence Simplification
- URL: http://arxiv.org/abs/2310.15773v1
- Date: Tue, 24 Oct 2023 12:18:17 GMT
- Title: BLESS: Benchmarking Large Language Models on Sentence Simplification
- Authors: Tannon Kew, Alison Chi, Laura V\'asquez-Rodr\'iguez, Sweta Agrawal,
Dennis Aumiller, Fernando Alva-Manchego, Matthew Shardlow
- Abstract summary: We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
- Score: 55.461555829492866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present BLESS, a comprehensive performance benchmark of the most recent
state-of-the-art large language models (LLMs) on the task of text
simplification (TS). We examine how well off-the-shelf LLMs can solve this
challenging task, assessing a total of 44 models, differing in size,
architecture, pre-training methods, and accessibility, on three test sets from
different domains (Wikipedia, news, and medical) under a few-shot setting. Our
analysis considers a suite of automatic metrics as well as a large-scale
quantitative investigation into the types of common edit operations performed
by the different models. Furthermore, we perform a manual qualitative analysis
on a subset of model outputs to better gauge the quality of the generated
simplifications. Our evaluation indicates that the best LLMs, despite not being
trained on TS, perform comparably with state-of-the-art TS baselines.
Additionally, we find that certain LLMs demonstrate a greater range and
diversity of edit operations. Our performance benchmark will be available as a
resource for the development of future TS methods and evaluation metrics.
Related papers
- What do Large Language Models Need for Machine Translation Evaluation? [12.42394213466485]
Large language models (LLMs) can achieve results comparable to fine-tuned multilingual pre-trained language models.
This paper explores what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate machine translation quality.
arXiv Detail & Related papers (2024-10-04T09:50:45Z) - MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs [47.94710556156627]
MIA-Bench is a benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions.
Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions.
arXiv Detail & Related papers (2024-07-01T17:53:35Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Multi-Conditional Ranking with Large Language Models [4.390998479503661]
Using large language models to rank a set of items has become a common approach in recommendation and retrieval systems.
However, real-world scenarios often present a different challenge: ranking a comparatively smaller set of items, but according to a variety of diverse and occasionally conflicting conditions.
We propose a novel decomposed reasoning method, consisting of EXtracting and Sorting the conditions, and then Iteratively Ranking the items.
arXiv Detail & Related papers (2024-03-30T01:26:05Z) - T-Eval: Evaluating the Tool Utilization Capability of Large Language
Models Step by Step [69.64348626180623]
Large language models (LLM) have achieved remarkable performance on various NLP tasks.
How to evaluate and analyze the tool-utilization capability of LLMs is still under-explored.
We introduce T-Eval to evaluate the tool utilization capability step by step.
arXiv Detail & Related papers (2023-12-21T17:02:06Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Scaling Sentence Embeddings with Large Language Models [43.19994568210206]
In this work, we propose an in-context learning-based method aimed at improving sentence embeddings performance.
Our approach involves adapting the previous prompt-based representation method for autoregressive models.
By scaling model size, we find scaling to more than tens of billion parameters harms the performance on semantic textual similarity tasks.
arXiv Detail & Related papers (2023-07-31T13:26:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.