Benchmarking LLMs on the Semantic Overlap Summarization Task
- URL: http://arxiv.org/abs/2402.17008v1
- Date: Mon, 26 Feb 2024 20:33:50 GMT
- Title: Benchmarking LLMs on the Semantic Overlap Summarization Task
- Authors: John Salvador, Naman Bansal, Mousumi Akter, Souvika Sarkar, Anupam
Das, and Shubhra Kanti Karmaker ("Santu")
- Abstract summary: This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task.
We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
- Score: 9.656095701778975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic Overlap Summarization (SOS) is a constrained multi-document
summarization task, where the constraint is to capture the common/overlapping
information between two alternative narratives. While recent advancements in
Large Language Models (LLMs) have achieved superior performance in numerous
summarization tasks, a benchmarking study of the SOS task using LLMs is yet to
be performed. As LLMs' responses are sensitive to slight variations in prompt
design, a major challenge in conducting such a benchmarking study is to
systematically explore a variety of prompts before drawing a reliable
conclusion. Fortunately, very recently, the TELeR taxonomy has been proposed
which can be used to design and explore various prompts for LLMs. Using this
TELeR taxonomy and 15 popular LLMs, this paper comprehensively evaluates LLMs
on the SOS Task, assessing their ability to summarize overlapping information
from multiple alternative narratives. For evaluation, we report
well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different
datasets of alternative narratives. We conclude the paper by analyzing the
strengths and limitations of various LLMs in terms of their capabilities in
capturing overlapping information The code and datasets used to conduct this
study are available at https://anonymous.4open.science/r/llm_eval-E16D.
Related papers
- Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization [0.27624021966289597]
This paper introduces EYEGLAXS, a framework that leverages Large Language Models (LLMs) for extractive summarization.
EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity.
The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv.
arXiv Detail & Related papers (2024-08-28T13:52:19Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Revisiting Large Language Models as Zero-shot Relation Extractors [8.953462875381888]
Relation extraction (RE) consistently involves a certain degree of labeled or unlabeled data even if under zero-shot setting.
Recent studies have shown that large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt.
This work focuses on the study of exploring LLMs as zero-shot relation extractors.
arXiv Detail & Related papers (2023-10-08T06:17:39Z) - Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event.
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm.
The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z) - Large Language Models for Software Engineering: A Systematic Literature Review [34.12458948051519]
Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE)
We select and analyze 395 research papers from January 2017 to January 2024 to answer four key research questions (RQs)
From the answers to these RQs, we discuss the current state-of-the-art and trends, identifying gaps in existing research, and flagging promising areas for future study.
arXiv Detail & Related papers (2023-08-21T10:37:49Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.