NumHG: A Dataset for Number-Focused Headline Generation
- URL: http://arxiv.org/abs/2309.01455v1
- Date: Mon, 4 Sep 2023 09:03:53 GMT
- Title: NumHG: A Dataset for Number-Focused Headline Generation
- Authors: Jian-Tao Huang, Chung-Chi Chen, Hen-Hsen Huang, Hsin-Hsi Chen
- Abstract summary: Headline generation, a key task in abstractive summarization, strives to condense a full-length article into a succinct, single line of text.
We introduce a new dataset, the NumHG, and provide over 27,000 annotated numeral-rich news articles for detailed investigation.
We evaluate five well-performing models from previous headline generation tasks using human evaluation in terms of numerical accuracy, reasonableness, and readability.
- Score: 28.57003500212883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Headline generation, a key task in abstractive summarization, strives to
condense a full-length article into a succinct, single line of text. Notably,
while contemporary encoder-decoder models excel based on the ROUGE metric, they
often falter when it comes to the precise generation of numerals in headlines.
We identify the lack of datasets providing fine-grained annotations for
accurate numeral generation as a major roadblock. To address this, we introduce
a new dataset, the NumHG, and provide over 27,000 annotated numeral-rich news
articles for detailed investigation. Further, we evaluate five well-performing
models from previous headline generation tasks using human evaluation in terms
of numerical accuracy, reasonableness, and readability. Our study reveals a
need for improvement in numerical accuracy, demonstrating the potential of the
NumHG dataset to drive progress in number-focused headline generation and
stimulate further discussions in numeral-focused text generation.
Related papers
- Teaching Large Language Models Number-Focused Headline Generation With Key Element Rationales [11.428237505896218]
Number-focused headline generation is a unique challenge for Large Language Models (LLMs)
We propose a novel chain-of-thought framework for using rationales comprising key elements of the Topic, Entities, and Numerical reasoning (TEN) in news articles.
Our approach teaches the student LLM automatic generation of rationales with enhanced capability for numerical reasoning and topic-aligned numerical headline generation.
arXiv Detail & Related papers (2025-02-05T12:39:07Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge
Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text.
Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references.
We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z) - How to Choose Pretrained Handwriting Recognition Models for Single
Writer Fine-Tuning [23.274139396706264]
Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on modern and historical manuscripts.
Those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting.
In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model.
We give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines
arXiv Detail & Related papers (2023-05-04T07:00:28Z) - Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation [92.1582872870226]
We propose a new grounded keys-to-text generation task.
The task is to generate a factual description about an entity given a set of guiding keys, and grounding passages.
Inspired by recent QA-based evaluation measures, we propose an automatic metric, MAFE, for factual correctness of generated descriptions.
arXiv Detail & Related papers (2022-12-04T23:59:41Z) - NumGPT: Improving Numeracy Ability of Generative Pre-trained Models [59.931394234642816]
We propose NumGPT, a generative pre-trained model that explicitly models the numerical properties of numbers in texts.
Specifically, it leverages a prototype-based numeral embedding to encode the mantissa of the number and an individual embedding to encode the exponent of the number.
A numeral-aware loss function is designed to integrate numerals into the pre-training objective of NumGPT.
arXiv Detail & Related papers (2021-09-07T15:06:12Z) - Introducing a new high-resolution handwritten digits data set with
writer characteristics [0.0]
We introduce a new handwritten digit data set that we collected.
It contains high-resolution images of handwritten digits together with various writer characteristics.
Multiple writer characteristics gathered are a novelty of our data set and create new research opportunities.
arXiv Detail & Related papers (2020-11-04T18:18:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.