Cross-Register Projection for Headline Part of Speech Tagging
- URL: http://arxiv.org/abs/2109.07483v1
- Date: Wed, 15 Sep 2021 18:00:02 GMT
- Title: Cross-Register Projection for Headline Part of Speech Tagging
- Authors: Adrian Benton, Hanyang Li, Igor Malioutov
- Abstract summary: We train a multi-domain POS tagger on both long-form and headline text.
We show that our model yields a 23% relative error reduction per token and 19% per headline.
We make POSH, the POS-tagged Headline corpus, available to encourage research in improved NLP models for news headlines.
- Score: 3.5455943749695034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Part of speech (POS) tagging is a familiar NLP task. State of the art taggers
routinely achieve token-level accuracies of over 97% on news body text,
evidence that the problem is well understood. However, the register of English
news headlines, "headlinese", is very different from the register of long-form
text, causing POS tagging models to underperform on headlines. In this work, we
automatically annotate news headlines with POS tags by projecting predicted
tags from corresponding sentences in news bodies. We train a multi-domain POS
tagger on both long-form and headline text and show that joint training on both
registers improves over training on just one or naively concatenating training
sets. We evaluate on a newly-annotated corpus of over 5,248 English news
headlines from the Google sentence compression corpus, and show that our model
yields a 23% relative error reduction per token and 19% per headline. In
addition, we demonstrate that better headline POS tags can improve the
performance of a syntax-based open information extraction system. We make POSH,
the POS-tagged Headline corpus, available to encourage research in improved NLP
models for news headlines.
Related papers
- Colloquial Persian POS (CPPOS) Corpus: A Novel Corpus for Colloquial
Persian Part of Speech Tagging [0.9843385481559193]
This paper introduces a novel corpus, "Colloquial Persian POS" (CPPOS), specifically designed to support colloquial Persian text.
The corpus includes formal and informal text collected from various domains such as political, social, and commercial on Telegram, Twitter, and Instagram.
arXiv Detail & Related papers (2023-10-01T05:06:33Z) - Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News
Detection [50.07850264495737]
"Prompt-and-Align" (P&A) is a novel prompt-based paradigm for few-shot fake news detection.
We show that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.
arXiv Detail & Related papers (2023-09-28T13:19:43Z) - Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - Weakly Supervised Headline Dependency Parsing [20.246696104447985]
English news headlines form a register with unique syntactic properties that have been documented in literature since the 1930s.
We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies syntactic dependency trees.
arXiv Detail & Related papers (2023-01-25T01:00:16Z) - Graph-Based Multilingual Label Propagation for Low-Resource
Part-of-Speech Tagging [0.44798341036073835]
Part-of-Speech (POS) tagging is an important component of the NLP pipeline.
Many low-resource languages lack labeled data for training.
We propose a novel method for transferring labels from multiple high-resource source to low-resource target languages.
arXiv Detail & Related papers (2022-10-18T13:26:09Z) - Sentiment-Aware Word and Sentence Level Pre-training for Sentiment
Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks.
SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction [21.67895423776014]
We consider POS tagging within the framework of set-valued prediction.
We find that extending state-of-the-art POS taggers to set-valued prediction yields more precise and robust taggings.
arXiv Detail & Related papers (2020-08-04T07:21:36Z) - Adversarial Transfer Learning for Punctuation Restoration [58.2201356693101]
Adversarial multi-task learning is introduced to learn task invariant knowledge for punctuation prediction.
Experiments are conducted on IWSLT2011 datasets.
arXiv Detail & Related papers (2020-04-01T06:19:56Z) - Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing? [22.93722845643562]
We show that POS tagging can still significantly improve parsing performance when using the Stack joint framework.
Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS tag data.
arXiv Detail & Related papers (2020-03-06T13:47:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.