Which is better? Exploring Prompting Strategy For LLM-based Metrics
- URL: http://arxiv.org/abs/2311.03754v1
- Date: Tue, 7 Nov 2023 06:36:39 GMT
- Title: Which is better? Exploring Prompting Strategy For LLM-based Metrics
- Authors: Joonghoon Kim, Saeran Park, Kiyoon Jeong, Sangmin Lee, Seung Hun Han,
Jiyoon Lee, Pilsung Kang
- Abstract summary: This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task.
Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks.
- Score: 6.681126871165601
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper describes the DSBA submissions to the Prompting Large Language
Models as Explainable Metrics shared task, where systems were submitted to two
tracks: small and large summarization tracks. With advanced Large Language
Models (LLMs) such as GPT-4, evaluating the quality of Natural Language
Generation (NLG) has become increasingly paramount. Traditional
similarity-based metrics such as BLEU and ROUGE have shown to misalign with
human evaluation and are ill-suited for open-ended generation tasks. To address
this issue, we explore the potential capability of LLM-based metrics,
especially leveraging open-source LLMs. In this study, wide range of prompts
and prompting techniques are systematically analyzed with three approaches:
prompting strategy, score aggregation, and explainability. Our research focuses
on formulating effective prompt templates, determining the granularity of NLG
quality scores and assessing the impact of in-context examples on LLM-based
evaluation. Furthermore, three aggregation strategies are compared to identify
the most reliable method for aggregating NLG quality scores. To examine
explainability, we devise a strategy that generates rationales for the scores
and analyzes the characteristics of the explanation produced by the open-source
LLMs. Extensive experiments provide insights regarding evaluation capabilities
of open-source LLMs and suggest effective prompting strategies.
Related papers
- Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs)
It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions.
Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - A Survey on Large Language Models for Recommendation [77.91673633328148]
Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP)
This survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec)
arXiv Detail & Related papers (2023-05-31T13:51:26Z) - Information Extraction in Low-Resource Scenarios: Survey and Perspective [56.5556523013924]
Information Extraction seeks to derive structured information from unstructured texts.
This paper presents a review of neural approaches to low-resource IE from emphtraditional and emphLLM-based perspectives.
arXiv Detail & Related papers (2022-02-16T13:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.