Generating Representative Headlines for News Stories
- URL: http://arxiv.org/abs/2001.09386v4
- Date: Mon, 13 Apr 2020 21:47:52 GMT
- Title: Generating Representative Headlines for News Stories
- Authors: Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, Hongkun Yu, You Wu,
Cong Yu, Daniel Finnie, Jiaqi Zhai, Nicholas Zukoski
- Abstract summary: Grouping articles that are reporting the same event into news stories is a common way of assisting readers in their news consumption.
It remains a challenging research problem to efficiently and effectively generate a representative headline for each story.
We develop a distant supervision approach to train large-scale generation models without any human annotation.
- Score: 31.67864779497127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Millions of news articles are published online every day, which can be
overwhelming for readers to follow. Grouping articles that are reporting the
same event into news stories is a common way of assisting readers in their news
consumption. However, it remains a challenging research problem to efficiently
and effectively generate a representative headline for each story. Automatic
summarization of a document set has been studied for decades, while few studies
have focused on generating representative headlines for a set of articles.
Unlike summaries, which aim to capture most information with least redundancy,
headlines aim to capture information jointly shared by the story articles in
short length, and exclude information that is too specific to each individual
article. In this work, we study the problem of generating representative
headlines for news stories. We develop a distant supervision approach to train
large-scale generation models without any human annotation. This approach
centers on two technical components. First, we propose a multi-level
pre-training framework that incorporates massive unlabeled corpus with
different quality-vs.-quantity balance at different levels. We show that models
trained within this framework outperform those trained with pure human curated
corpus. Second, we propose a novel self-voting-based article attention layer to
extract salient information shared by multiple articles. We show that models
that incorporate this layer are robust to potential noises in news stories and
outperform existing baselines with or without noises. We can further enhance
our model by incorporating human labels, and we show our distant supervision
approach significantly reduces the demand on labeled data.
Related papers
- A Novel Method for News Article Event-Based Embedding [8.183446952097528]
We propose a novel lightweight method that optimized news embedding generation by focusing on entities and themes mentioned in articles.
We leveraged over 850,000 news articles and 1,000,000 events from the GDELT project to test and evaluate our method.
Our experiments demonstrate that our approach can both improve and outperform state-of-the-art methods on shared event detection tasks.
arXiv Detail & Related papers (2024-05-20T20:55:07Z) - SCStory: Self-supervised and Continual Online Story Discovery [53.72745249384159]
SCStory helps people digest rapidly published news article streams in real-time without human annotations.
SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams.
arXiv Detail & Related papers (2023-11-27T04:50:01Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - Unsupervised Story Discovery from Continuous News Streams via Scalable
Thematic Embedding [37.62597275581973]
Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations.
We propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories.
A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines.
arXiv Detail & Related papers (2023-04-08T20:41:15Z) - NEWTS: A Corpus for News Topic-Focused Summarization [9.872518517174498]
This paper introduces the first topical summarization corpus, based on the well-known CNN/Dailymail dataset.
We evaluate a range of existing techniques and analyze the effectiveness of different prompting methods.
arXiv Detail & Related papers (2022-05-31T10:01:38Z) - "Don't quote me on that": Finding Mixtures of Sources in News Articles [85.92467549469147]
We construct an ontological labeling system for sources based on each source's textitaffiliation and textitrole
We build a probabilistic model to infer these attributes for named sources and to describe news articles as mixtures of these sources.
arXiv Detail & Related papers (2021-04-19T21:57:11Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - VMSMO: Learning to Generate Multimodal Summary for Video-based News
Articles [63.32111010686954]
We propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO)
The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article.
We propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator.
arXiv Detail & Related papers (2020-10-12T02:19:16Z) - Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News [57.9843300852526]
We introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions.
To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles.
In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visual-semantic inconsistencies.
arXiv Detail & Related papers (2020-09-16T14:13:15Z) - Zero-shot topic generation [10.609815608017065]
We present an approach to generating topics using a model trained only for document title generation.
We leverage features that capture the relevance of a candidate span in a document for the generation of a title for that document.
The output is a weighted collection of the phrases that are most relevant for describing the document and distinguishing it within a corpus.
arXiv Detail & Related papers (2020-04-29T04:39:28Z) - BaitWatcher: A lightweight web interface for the detection of
incongruent news headlines [27.29585619643952]
BaitWatcher is a lightweight web interface that guides readers in estimating the likelihood of incongruence in news articles before clicking on the headlines.
BaiittWatcher utilizes a hierarchical recurrent encoder that efficiently learns complex textual representations of a news headline and its associated body text.
arXiv Detail & Related papers (2020-03-23T23:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.