Can LMs Generalize to Future Data? An Empirical Analysis on Text
Summarization
- URL: http://arxiv.org/abs/2305.01951v3
- Date: Thu, 2 Nov 2023 12:07:48 GMT
- Title: Can LMs Generalize to Future Data? An Empirical Analysis on Text
Summarization
- Authors: Chi Seng Cheang, Hou Pong Chan, Derek F. Wong, Xuebo Liu, Zhaocong Li,
Yanming Sun, Shudong Liu, Lidia S. Chao
- Abstract summary: Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets.
Existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets.
We show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data.
- Score: 50.20034493626049
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent pre-trained language models (PLMs) achieve promising results in
existing abstractive summarization datasets. However, existing summarization
benchmarks overlap in time with the standard pre-training corpora and
finetuning datasets. Hence, the strong performance of PLMs may rely on the
parametric knowledge that is memorized during pre-training and fine-tuning.
Moreover, the knowledge memorized by PLMs may quickly become outdated, which
affects the generalization performance of PLMs on future data. In this work, we
propose TempoSum, a novel benchmark that contains data samples from 2010 to
2022, to understand the temporal generalization ability of abstractive
summarization models. Through extensive human evaluation, we show that
parametric knowledge stored in summarization models significantly affects the
faithfulness of the generated summaries on future data. Moreover, existing
faithfulness enhancement methods cannot reliably improve the faithfulness of
summarization models on future data. Finally, we discuss several
recommendations to the research community on how to evaluate and improve the
temporal generalization capability of text summarization models.
Related papers
- Tackling Data Heterogeneity in Federated Time Series Forecasting [61.021413959988216]
Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting.
Most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices to a central cloud server.
We propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers.
arXiv Detail & Related papers (2024-11-24T04:56:45Z) - Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle [13.192628306219248]
We propose using future event prediction as a continuous evaluation method to assess Large Language Models' temporal generalization abilities.
Our benchmark, Daily Oracle, automatically generates question-answer pairs from daily news, challenging LLMs to predict "future" event outcomes.
arXiv Detail & Related papers (2024-11-13T04:20:20Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Model-based Preference Optimization in Abstractive Summarization without Human Feedback [5.438770095369458]
We introduce Model-based Preference Optimization (MPO) to fine-tune Large Language Models for improved summarization abilities without any human feedback.
Our experiments on standard summarization datasets and various metrics demonstrate that our proposed MPO significantly enhances the quality of generated summaries without relying on human feedback.
arXiv Detail & Related papers (2024-09-27T10:35:45Z) - A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting [45.0261082985087]
We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting.
We find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance.
In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance.
arXiv Detail & Related papers (2024-07-16T11:58:54Z) - Benchmarking Benchmark Leakage in Large Language Models [24.015208839742343]
We introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark.
We reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons.
We propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization.
arXiv Detail & Related papers (2024-04-29T16:05:36Z) - Continual Learning with Pre-Trained Models: A Survey [61.97613090666247]
Continual Learning aims to overcome the catastrophic forgetting of former knowledge when learning new ones.
This paper presents a comprehensive survey of the latest advancements in PTM-based CL.
arXiv Detail & Related papers (2024-01-29T18:27:52Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - Learning summary features of time series for likelihood free inference [93.08098361687722]
We present a data-driven strategy for automatically learning summary features from time series data.
Our results indicate that learning summary features from data can compete and even outperform LFI methods based on hand-crafted values.
arXiv Detail & Related papers (2020-12-04T19:21:37Z) - Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM.
Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.