Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks
- URL: http://arxiv.org/abs/2311.12534v1
- Date: Tue, 21 Nov 2023 11:26:26 GMT
- Title: Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks
- Authors: Simone Filice, Jason Ingyu Choi, Giuseppe Castellucci, Eugene
Agichtein, Oleg Rokhlenko
- Abstract summary: We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
- Score: 22.629816738693254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many Natural Language Generation (NLG) tasks aim to generate a single output
text given an input prompt. Other settings require the generation of multiple
texts, e.g., for Synthetic Traffic Generation (STG). This generation task is
crucial for training and evaluating QA systems as well as conversational
agents, where the goal is to generate multiple questions or utterances
resembling the linguistic variability of real users. In this paper, we show
that common NLG metrics, like BLEU, are not suitable for evaluating STG. We
propose and evaluate several metrics designed to compare the generated traffic
to the distribution of real user texts. We validate our metrics with an
automatic procedure to verify whether they capture different types of quality
issues of generated data; we also run human annotations to verify the
correlation with human judgements. Experiments on three tasks, i.e., Shopping
Utterance Generation, Product Question Generation and Query Auto Completion,
demonstrate that our metrics are effective for evaluating STG tasks, and
improve the agreement with human judgement up to 20% with respect to common NLG
metrics. We believe these findings can pave the way towards better solutions
for estimating the representativeness of synthetic text data.
Related papers
- Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric
Preference Checklist [20.448405494617397]
Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks.
Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective.
We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks.
arXiv Detail & Related papers (2023-05-15T11:51:55Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z) - Data-QuestEval: A Referenceless Metric for Data to Text Semantic
Evaluation [33.672301484161416]
QuestEval is a metric that compares predictions directly to structured input data by automatically asking and answering questions.
We build synthetic multi-modal corpora that enables to train multi-modal QG/QA.
The resulting metric is reference-less, multi-modal; it obtains state-of-the-art correlations with human judgement on the E2E and WebNLG benchmark.
arXiv Detail & Related papers (2021-04-15T16:10:46Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z) - How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for
Token-level Evaluation Metrics [47.20761880464552]
generative dialogue modeling is widely seen as a language modeling task.
The task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user.
The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent.
arXiv Detail & Related papers (2020-08-24T13:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.