Related papers: SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation

SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation

URL: http://arxiv.org/abs/2405.10650v8
Date: Mon, 15 Jul 2024 11:47:29 GMT
Title: SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation
Authors: Ziyao Xu, Houfeng Wang,
Abstract summary: We propose SPOR, a comprehensive and practical evaluation method for compositional generalization in data-to-text generation. We demonstrate SPOR on two different datasets and evaluate some existing language models including LLMs.
Score: 21.68354181391989
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compositional generalization is an important ability of language models and has many different manifestations. For data-to-text generation, previous research on this ability is limited to a single manifestation called Systematicity and lacks consideration of large language models (LLMs), which cannot fully cover practical application scenarios. In this work, we propose SPOR, a comprehensive and practical evaluation method for compositional generalization in data-to-text generation. SPOR includes four aspects of manifestations (Systematicity, Productivity, Order invariance, and Rule learnability) and allows high-quality evaluation without additional manual annotations based on existing datasets. We demonstrate SPOR on two different datasets and evaluate some existing language models including LLMs. We find that the models are deficient in various aspects of the evaluation and need further improvement. Our work shows the necessity for comprehensive research on different manifestations of compositional generalization in data-to-text generation and provides a framework for evaluation.

Related papers

Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation [56.82274763974443]
ICAT is an evaluation framework for measuring coverage of diverse factual information in long-form text generation. It computes the alignment between the atomic factual claims and various aspects expected to be presented in the output. Our framework offers interpretable and fine-grained analysis of diversity and coverage.
arXiv Detail & Related papers (2025-01-07T05:43:23Z)
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models [33.488331159912136]
Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. We present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs.
arXiv Detail & Related papers (2024-08-04T16:50:07Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored. Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges. We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Multi-Dimensional Evaluation of Text Summarization with In-Context Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning. Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization. We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Improving Compositional Generalization with Self-Training for Data-to-Text Generation [36.973617793800315]
We study the compositional generalization of current generation models in data-to-text tasks. By simulating structural shifts in the compositional Weather dataset, we show that T5 models fail to generalize to unseen structures. We propose an approach based on self-training using finetuned BLEURT for pseudo-response selection.
arXiv Detail & Related papers (2021-10-16T04:26:56Z)
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets [17.13484629172643]
We develop a framework to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. We propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
arXiv Detail & Related papers (2021-06-16T18:20:58Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.