Language Model as an Annotator: Unsupervised Context-aware Quality
Phrase Generation
- URL: http://arxiv.org/abs/2312.17349v1
- Date: Thu, 28 Dec 2023 20:32:44 GMT
- Title: Language Model as an Annotator: Unsupervised Context-aware Quality
Phrase Generation
- Authors: Zhihao Zhang, Yuan Zuo, Chenghua Lin, Junjie Wu
- Abstract summary: We propose LMPhrase, a novel unsupervised quality phrase mining framework built upon large pre-trained language models (LMs)
Specifically, we first mine quality phrases as silver labels by employing a parameter-free probing technique called Perturbed Masking on the pre-trained language model BERT.
In contrast to typical statistic-based or distantly-supervised methods, our silver labels, derived from large pre-trained language models, take into account rich contextual information contained in the LMs.
- Score: 20.195149109523314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Phrase mining is a fundamental text mining task that aims to identify quality
phrases from context. Nevertheless, the scarcity of extensive gold labels
datasets, demanding substantial annotation efforts from experts, renders this
task exceptionally challenging. Furthermore, the emerging, infrequent, and
domain-specific nature of quality phrases presents further challenges in
dealing with this task. In this paper, we propose LMPhrase, a novel
unsupervised context-aware quality phrase mining framework built upon large
pre-trained language models (LMs). Specifically, we first mine quality phrases
as silver labels by employing a parameter-free probing technique called
Perturbed Masking on the pre-trained language model BERT (coined as Annotator).
In contrast to typical statistic-based or distantly-supervised methods, our
silver labels, derived from large pre-trained language models, take into
account rich contextual information contained in the LMs. As a result, they
bring distinct advantages in preserving informativeness, concordance, and
completeness of quality phrases. Secondly, training a discriminative span
prediction model heavily relies on massive annotated data and is likely to face
the risk of overfitting silver labels. Alternatively, we formalize phrase
tagging task as the sequence generation problem by directly fine-tuning on the
Sequence-to-Sequence pre-trained language model BART with silver labels (coined
as Generator). Finally, we merge the quality phrases from both the Annotator
and Generator as the final predictions, considering their complementary nature
and distinct characteristics. Extensive experiments show that our LMPhrase
consistently outperforms all the existing competitors across two different
granularity phrase mining tasks, where each task is tested on two different
domain datasets.
Related papers
- Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings [5.257719744958367]
This thesis explores three challenging settings in text classification by leveraging the intrinsic knowledge of pretrained language models (PLMs)
We develop models that utilize features based on contextualized word representations from PLMs, achieving performance that rivals or surpasses human accuracy.
Lastly, we tackle the sensitivity of large language models to in-context learning prompts by selecting effective demonstrations.
arXiv Detail & Related papers (2024-08-28T09:07:30Z) - GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Leveraging Natural Supervision for Language Representation Learning and
Generation [8.083109555490475]
We describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision.
We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks.
We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations.
arXiv Detail & Related papers (2022-07-21T17:26:03Z) - Learning to Selectively Learn for Weakly-supervised Paraphrase
Generation [81.65399115750054]
We propose a novel approach to generate high-quality paraphrases with weak supervision data.
Specifically, we tackle the weakly-supervised paraphrase generation problem by:.
obtaining abundant weakly-labeled parallel sentences via retrieval-based pseudo paraphrase expansion.
We demonstrate that our approach achieves significant improvements over existing unsupervised approaches, and is even comparable in performance with supervised state-of-the-arts.
arXiv Detail & Related papers (2021-09-25T23:31:13Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Unsupervised Paraphrase Generation using Pre-trained Language Models [0.0]
OpenAI's GPT-2 is notable for its capability to generate fluent, well formulated, grammatically consistent text.
We leverage this generation capability of GPT-2 to generate paraphrases without any supervision from labelled data.
Our experiments show that paraphrases generated with our model are of good quality, are diverse and improves the downstream task performance when used for data augmentation.
arXiv Detail & Related papers (2020-06-09T19:40:19Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.