UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
- URL: http://arxiv.org/abs/2105.14078v1
- Date: Fri, 28 May 2021 19:44:24 GMT
- Title: UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
- Authors: Xiaotao Gu, Zihan Wang, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han,
Jingbo Shang
- Abstract summary: UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
- Score: 63.86606855524567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Identifying and understanding quality phrases from context is a fundamental
task in text mining. The most challenging part of this task arguably lies in
uncommon, emerging, and domain-specific phrases. The infrequent nature of these
phrases significantly hurts the performance of phrase mining methods that rely
on sufficient phrase occurrences in the input corpus. Context-aware tagging
models, though not restricted by frequency, heavily rely on domain experts for
either massive sentence-level gold labels or handcrafted gazetteers. In this
work, we propose UCPhrase, a novel unsupervised context-aware quality phrase
tagger. Specifically, we induce high-quality phrase spans as silver labels from
consistently co-occurring word sequences within each document. Compared with
typical context-agnostic distant supervision based on existing knowledge bases
(KBs), our silver labels root deeply in the input domain and context, thus
having unique advantages in preserving contextual completeness and capturing
emerging, out-of-KB phrases. Training a conventional neural tagger based on
silver labels usually faces the risk of overfitting phrase surface names.
Alternatively, we observe that the contextualized attention maps generated from
a transformer-based neural language model effectively reveal the connections
between words in a surface-agnostic way. Therefore, we pair such attention maps
with the silver labels to train a lightweight span prediction model, which can
be applied to new input to recognize (unseen) quality phrases regardless of
their surface names or frequency. Thorough experiments on various tasks and
datasets, including corpus-level phrase ranking, document-level keyphrase
extraction, and sentence-level phrase tagging, demonstrate the superiority of
our design over state-of-the-art pre-trained, unsupervised, and distantly
supervised methods.
Related papers
- Language Model as an Annotator: Unsupervised Context-aware Quality
Phrase Generation [20.195149109523314]
We propose LMPhrase, a novel unsupervised quality phrase mining framework built upon large pre-trained language models (LMs)
Specifically, we first mine quality phrases as silver labels by employing a parameter-free probing technique called Perturbed Masking on the pre-trained language model BERT.
In contrast to typical statistic-based or distantly-supervised methods, our silver labels, derived from large pre-trained language models, take into account rich contextual information contained in the LMs.
arXiv Detail & Related papers (2023-12-28T20:32:44Z) - Towards Open Vocabulary Learning: A Survey [146.90188069113213]
Deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection.
Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training.
This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2023-06-28T02:33:06Z) - Sentiment-Aware Word and Sentence Level Pre-training for Sentiment
Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks.
SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z) - Keywords and Instances: A Hierarchical Contrastive Learning Framework
Unifying Hybrid Granularities for Text Generation [59.01297461453444]
We propose a hierarchical contrastive learning mechanism, which can unify hybrid granularities semantic meaning in the input text.
Experiments demonstrate that our model outperforms competitive baselines on paraphrasing, dialogue generation, and storytelling tasks.
arXiv Detail & Related papers (2022-05-26T13:26:03Z) - Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt
Verbalizer for Text Classification [68.3291372168167]
We focus on incorporating external knowledge into the verbalizer, forming a knowledgeable prompt-tuning (KPT)
We expand the label word space of the verbalizer using external knowledge bases (KBs) and refine the expanded label word space with the PLM itself before predicting with the expanded label word space.
Experiments on zero and few-shot text classification tasks demonstrate the effectiveness of knowledgeable prompt-tuning.
arXiv Detail & Related papers (2021-08-04T13:00:16Z) - Unsupervised Deep Keyphrase Generation [14.544869226959612]
Keyphrase generation aims to summarize long documents with a collection of salient phrases.
Deep neural models have demonstrated a remarkable success in this task, capable of predicting keyphrases that are even absent from a document.
We present a novel method for keyphrase generation, AutoKeyGen, without the supervision of any human annotation.
arXiv Detail & Related papers (2021-04-18T05:53:19Z) - Controlling Hallucinations at Word Level in Data-to-Text Generation [10.59137381324694]
State-of-art neural models include misleading statements in their outputs.
We propose a Multi-Branch Decoder which is able to leverage word-level labels to learn the relevant parts of each training instance.
Our model is able to reduce and control hallucinations, while keeping fluency and coherence in generated texts.
arXiv Detail & Related papers (2021-02-04T18:58:28Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Fast and Robust Unsupervised Contextual Biasing for Speech Recognition [16.557586847398778]
We propose an alternative approach that does not entail explicit contextual language model.
We derive the bias score for every word in the system vocabulary from the training corpus.
We show significant improvement in recognition accuracy when the relevant context is available.
arXiv Detail & Related papers (2020-05-04T17:29:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.