Weakly Supervised Text Classification using Supervision Signals from a
Language Model
- URL: http://arxiv.org/abs/2205.06604v1
- Date: Fri, 13 May 2022 12:57:15 GMT
- Title: Weakly Supervised Text Classification using Supervision Signals from a
Language Model
- Authors: Ziqian Zeng, Weimin Ni, Tianqing Fang, Xiang Li, Xinran Zhao and
Yangqiu Song
- Abstract summary: We design a prompt which combines the document itself and "this article is talking about [MASK]"
A masked language model can generate words for the [MASK] token.
The generated words which summarize the content of a document can be utilized as supervision signals.
- Score: 33.5830441120473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Solving text classification in a weakly supervised manner is important for
real-world applications where human annotations are scarce. In this paper, we
propose to query a masked language model with cloze style prompts to obtain
supervision signals. We design a prompt which combines the document itself and
"this article is talking about [MASK]." A masked language model can generate
words for the [MASK] token. The generated words which summarize the content of
a document can be utilized as supervision signals. We propose a latent variable
model to learn a word distribution learner which associates generated words to
pre-defined categories and a document classifier simultaneously without using
any annotated data. Evaluation on three datasets, AGNews, 20Newsgroups, and
UCINews, shows that our method can outperform baselines by 2%, 4%, and 3%.
Related papers
- NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents [17.94934249657174]
NextLevelBERT is a Masked Language Model operating not on tokens, but on higher-level semantic representations in the form of text embeddings.
We find that next-level Masked Language Modeling is an effective technique to tackle long-document use cases and can outperfor much larger embedding models as long as the required level of detail of semantic information is not too fine.
arXiv Detail & Related papers (2024-02-27T16:56:30Z) - Token Prediction as Implicit Classification to Identify LLM-Generated
Text [37.89852204279844]
This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation.
Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task.
We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experiments.
arXiv Detail & Related papers (2023-11-15T06:33:52Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - I2MVFormer: Large Language Model Generated Multi-View Document
Supervision for Zero-Shot Image Classification [108.83932812826521]
Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks.
Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views.
I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
arXiv Detail & Related papers (2022-12-05T14:11:36Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z) - Watch, read and lookup: learning to spot signs from multiple supervisors [99.50956498009094]
Given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video.
We train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles which provide additional weak-supervision; and (3) looking up words in visual sign language dictionaries.
These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning.
arXiv Detail & Related papers (2020-10-08T14:12:56Z) - Keyphrase Extraction with Span-based Feature Representations [13.790461555410747]
Keyphrases are capable of providing semantic metadata characterizing documents.
Three approaches to address keyphrase extraction: (i) traditional two-step ranking method, (ii) sequence labeling and (iii) generation using neural networks.
In this paper, we propose a novelty Span Keyphrase Extraction model that extracts span-based feature representation of keyphrase directly from all the content tokens.
arXiv Detail & Related papers (2020-02-13T09:48:31Z) - Adapting Deep Learning for Sentiment Classification of Code-Switched
Informal Short Text [1.6752182911522517]
We present a labeled dataset called MultiSenti for sentiment classification of code-switched informal short text.
We propose a deep learning-based model for sentiment classification of code-switched informal short text.
arXiv Detail & Related papers (2020-01-04T06:31:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.