Related papers: Unraveling the Capabilities of Language Models in News Summarization

Unraveling the Capabilities of Language Models in News Summarization

URL: http://arxiv.org/abs/2501.18128v1
Date: Thu, 30 Jan 2025 04:20:16 GMT
Title: Unraveling the Capabilities of Language Models in News Summarization
Authors: Abdurrahman Odabaşı, Göksel Biricik,
Abstract summary: This work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task.<n>We focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology.<n>We highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the recent introduction of multiple language models and the ongoing demand for improved Natural Language Processing tasks, particularly summarization, this work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. In this work, we systematically test the capabilities and effectiveness of these models in summarizing news article texts which are written in different styles and presented in three distinct datasets. Specifically, we focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology that combines different evaluation concepts including automatic metrics, human evaluation, and LLM-as-a-judge. Interestingly, including demonstration examples in the few-shot learning setting did not enhance models' performance and, in some cases, even led to worse quality of the generated summaries. This issue arises mainly due to the poor quality of the gold summaries that have been used as reference summaries, which negatively impacts the models' performance. Furthermore, our study's results highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities. However, among the public models evaluated, certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta demonstrated promising results. These models showed significant potential, positioning them as competitive alternatives to large models for the task of news summarization.

Related papers

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving [61.99289768925256]
EvaLearn is a benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks.<n>We benchmark nine frontier models and observe varied performance profiles.<n>We observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks.
arXiv Detail & Related papers (2025-06-03T09:18:33Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Small Language Models are Good Too: An Empirical Study of Zero-Shot Classification [4.4467858321751015]
We benchmark language models from 77M to 40B parameters using different architectures and scoring functions. Our findings reveal that small models can effectively classify texts, getting on par with or surpassing their larger counterparts. This research underscores the notion that bigger isn't always better, suggesting that resource-efficient small models may offer viable solutions for specific data classification challenges.
arXiv Detail & Related papers (2024-04-17T07:10:28Z)
A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check. Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models. The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z)
Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation [12.86275938443485]
We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance. We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios. We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
arXiv Detail & Related papers (2023-04-16T18:37:39Z)
Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting. The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z)
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)
Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models [17.914521288548844]
We argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted. Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality.
arXiv Detail & Related papers (2022-11-19T06:39:43Z)
On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z)
News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models. It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z)
A Systematic Investigation of Commonsense Understanding in Large Language Models [23.430757316504316]
Large language models have shown impressive performance on many natural language processing (NLP) tasks in a zero-shot setting. We ask whether these models exhibit commonsense understanding by evaluating models against four commonsense benchmarks.
arXiv Detail & Related papers (2021-10-31T22:20:36Z)
Comparative Study of Language Models on Cross-Domain Data with Model Agnostic Explainability [0.0]
The study compares the state-of-the-art language models - BERT, ELECTRA and its derivatives which include RoBERTa, ALBERT and DistilBERT. The experimental results establish new state-of-the-art for 2013 rating classification task and Financial Phrasebank sentiment detection task with 69% accuracy and 88.2% accuracy respectively.
arXiv Detail & Related papers (2020-09-09T04:31:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.