Zero-shot Conversational Summarization Evaluations with small Large
Language Models
- URL: http://arxiv.org/abs/2311.18041v1
- Date: Wed, 29 Nov 2023 19:34:34 GMT
- Title: Zero-shot Conversational Summarization Evaluations with small Large
Language Models
- Authors: Ramesh Manuvinakurike, Saurav Sahay, Sangeeta Manepalli, Lama Nachman
- Abstract summary: Large Language Models (LLMs) exhibit powerful summarization abilities.
We evaluate LLMs on conversational summarization and showcase their performance on various prompts.
We also evaluate the models with human evaluations and discuss the limitations of the models on conversational summarization.
- Score: 7.525771026977357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) exhibit powerful summarization abilities.
However, their capabilities on conversational summarization remains under
explored. In this work we evaluate LLMs (approx. 10 billion parameters) on
conversational summarization and showcase their performance on various prompts.
We show that the summaries generated by models depend on the instructions and
the performance of LLMs vary with different instructions sometimes resulting
steep drop in ROUGE scores if prompts are not selected carefully. We also
evaluate the models with human evaluations and discuss the limitations of the
models on conversational summarization
Related papers
- VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Information-Theoretic Distillation for Reference-less Summarization [67.51150817011617]
We present a novel framework to distill a powerful summarizer based on the information-theoretic objective for summarization.
We start off from Pythia-2.8B as the teacher model, which is not yet capable of summarization.
We arrive at a compact but powerful summarizer with only 568M parameters that performs competitively against ChatGPT.
arXiv Detail & Related papers (2024-03-20T17:42:08Z) - SemScore: Automated Evaluation of Instruction-Tuned LLMs based on
Semantic Textual Similarity [3.3162484539136416]
We propose a simple but remarkably effective evaluation metric called SemScore.
We compare model outputs to gold target responses using semantic textual similarity (STS)
We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation.
arXiv Detail & Related papers (2024-01-30T14:52:50Z) - Exploring Prompting Large Language Models as Explainable Metrics [0.0]
We propose a zero-shot prompt-based strategy for explainable evaluation of the summarization task using Large Language Models (LLMs)
The conducted experiments demonstrate the promising potential of LLMs as evaluation metrics in Natural Language Processing (NLP)
The performance of our best provided prompts achieved a Kendall correlation of 0.477 with human evaluations in the text summarization task on the test data.
arXiv Detail & Related papers (2023-11-20T06:06:22Z) - Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task.
Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency.
To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Revisiting Large Language Models as Zero-shot Relation Extractors [8.953462875381888]
Relation extraction (RE) consistently involves a certain degree of labeled or unlabeled data even if under zero-shot setting.
Recent studies have shown that large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt.
This work focuses on the study of exploring LLMs as zero-shot relation extractors.
arXiv Detail & Related papers (2023-10-08T06:17:39Z) - Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs)
Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.