Modeling "Newsworthiness" for Lead-Generation Across Corpora
- URL: http://arxiv.org/abs/2104.09653v1
- Date: Mon, 19 Apr 2021 21:48:15 GMT
- Title: Modeling "Newsworthiness" for Lead-Generation Across Corpora
- Authors: Alexander Spangher, Nanyun Peng, Jonathan May and Emilio Ferrara
- Abstract summary: We train models on automatically labeled corpora to predict whether each article was a front-page article.
We rank documents in unlabeled corpora on "newsworthiness"
A fine-tuned RoBERTa model achieves.93 AUC performance on heldout labeled documents, and.88 AUC on expert-validated unlabeled corpora.
- Score: 85.92467549469147
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Journalists obtain "leads", or story ideas, by reading large corpora of
government records: court cases, proposed bills, etc. However, only a small
percentage of such records are interesting documents. We propose a model of
"newsworthiness" aimed at surfacing interesting documents. We train models on
automatically labeled corpora -- published newspaper articles -- to predict
whether each article was a front-page article (i.e., \textbf{newsworthy}) or
not (i.e., \textbf{less newsworthy}). We transfer these models to unlabeled
corpora -- court cases, bills, city-council meeting minutes -- to rank
documents in these corpora on "newsworthiness". A fine-tuned RoBERTa model
achieves .93 AUC performance on heldout labeled documents, and .88 AUC on
expert-validated unlabeled corpora. We provide interpretation and visualization
for our models.
Related papers
- TLDR: Token-Level Detective Reward Model for Large Vision Language Models [57.41524422460438]
Existing reward models only mimic human annotations by assigning only one binary feedback to any text.
We propose a $textbfT$oken-$textbfL$evel $textbfD$etective $textbfR$eward Model to provide fine-grained annotations to each text token.
arXiv Detail & Related papers (2024-10-07T04:00:22Z) - Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text [0.0]
Large language models can annotate documents without supervised training, an ability known as zero-shot learning.
This paper introduces the Political DEBATE language models for zero-shot and few-shot classification of political documents.
We release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
arXiv Detail & Related papers (2024-09-03T17:26:17Z) - Trustless Audits without Revealing Data or Models [49.23322187919369]
We show that it is possible to allow model providers to keep their model weights (but not architecture) and data secret while allowing other parties to trustlessly audit model and data properties.
We do this by designing a protocol called ZkAudit in which model providers publish cryptographic commitments of datasets and model weights.
arXiv Detail & Related papers (2024-04-06T04:43:06Z) - Tracking the Newsworthiness of Public Documents [107.12303391111014]
This work focuses on news coverage of local public policy in the San Francisco Bay Area by the San Francisco Chronicle.
First, we gather news articles, public policy documents and meeting recordings and link them using probabilistic relational modeling.
Second, we define a new task: newsworthiness prediction, to predict if a policy item will get covered.
arXiv Detail & Related papers (2023-11-16T10:05:26Z) - The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only [48.498376125522114]
We show that properly filtered and deduplicated web data alone can lead to powerful models.
We release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
arXiv Detail & Related papers (2023-06-01T20:03:56Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - Quantifying Political Bias in News Articles [0.15229257192293202]
We aim to establish an automated model for evaluating ideological bias in online news articles.
The current automated model results show that model capability is not sufficient to be exploited for annotating the documents automatically.
arXiv Detail & Related papers (2022-10-07T08:51:20Z) - OpenFraming: We brought the ML; you bring the data. Interact with your
data and discover its frames [13.695739582457872]
We introduce a Web-based system for analyzing and classifying frames in text documents.
We provide both state-of-the-art pre-trained frame classification models on various issues and a user-friendly pipeline for training novel classification models.
The code making up our system is also open-sourced and well-documented, making the system transparent and expandable.
arXiv Detail & Related papers (2020-08-16T18:59:30Z) - Zero-shot topic generation [10.609815608017065]
We present an approach to generating topics using a model trained only for document title generation.
We leverage features that capture the relevance of a candidate span in a document for the generation of a title for that document.
The output is a weighted collection of the phrases that are most relevant for describing the document and distinguishing it within a corpus.
arXiv Detail & Related papers (2020-04-29T04:39:28Z) - Generating Representative Headlines for News Stories [31.67864779497127]
Grouping articles that are reporting the same event into news stories is a common way of assisting readers in their news consumption.
It remains a challenging research problem to efficiently and effectively generate a representative headline for each story.
We develop a distant supervision approach to train large-scale generation models without any human annotation.
arXiv Detail & Related papers (2020-01-26T02:08:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.