Weakly-Supervised Scientific Document Classification via
Retrieval-Augmented Multi-Stage Training
- URL: http://arxiv.org/abs/2306.07193v1
- Date: Mon, 12 Jun 2023 15:50:13 GMT
- Title: Weakly-Supervised Scientific Document Classification via
Retrieval-Augmented Multi-Stage Training
- Authors: Ran Xu, Yue Yu, Joyce C. Ho, Carl Yang
- Abstract summary: We propose a weakly-supervised approach for scientific document classification using label names only.
In scientific domains, label names often include domain-specific concepts that may not appear in the document corpus.
We show that WANDER outperforms the best baseline by 11.9% on average.
- Score: 24.2734548438594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientific document classification is a critical task for a wide range of
applications, but the cost of obtaining massive amounts of human-labeled data
can be prohibitive. To address this challenge, we propose a weakly-supervised
approach for scientific document classification using label names only. In
scientific domains, label names often include domain-specific concepts that may
not appear in the document corpus, making it difficult to match labels and
documents precisely. To tackle this issue, we propose WANDER, which leverages
dense retrieval to perform matching in the embedding space to capture the
semantics of label names. We further design the label name expansion module to
enrich the label name representations. Lastly, a self-training step is used to
refine the predictions. The experiments on three datasets show that WANDER
outperforms the best baseline by 11.9% on average. Our code will be published
at https://github.com/ritaranx/wander.
Related papers
- Open-world Multi-label Text Classification with Extremely Weak Supervision [30.85235057480158]
We study open-world multi-label text classification under extremely weak supervision (XWS)
We first utilize the user description to prompt a large language model (LLM) for dominant keyphrases of a subset of raw documents, and then construct a label space via clustering.
We then apply a zero-shot multi-label classifier to locate the documents with small top predicted scores, so we can revisit their dominant keyphrases for more long-tail labels.
X-MLClass exhibits a remarkable increase in ground-truth label space coverage on various datasets.
arXiv Detail & Related papers (2024-07-08T04:52:49Z) - Towards Imbalanced Large Scale Multi-label Classification with Partially
Annotated Labels [8.977819892091]
Multi-label classification is a widely encountered problem in daily life, where an instance can be associated with multiple classes.
In this work, we address the issue of label imbalance and investigate how to train neural networks using partial labels.
arXiv Detail & Related papers (2023-07-31T21:50:48Z) - Exploring Structured Semantic Prior for Multi Label Recognition with
Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging.
Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations.
We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Adopting the Multi-answer Questioning Task with an Auxiliary Metric for
Extreme Multi-label Text Classification Utilizing the Label Hierarchy [10.87653109398961]
This paper adopts the multi-answer questioning task for extreme multi-label classification.
This study adopts the proposed method and the evaluation metric to the legal domain.
arXiv Detail & Related papers (2023-03-02T08:40:31Z) - Label Semantic Aware Pre-training for Few-shot Text Classification [53.80908620663974]
We propose Label Semantic Aware Pre-training (LSAP) to improve the generalization and data efficiency of text classification systems.
LSAP incorporates label semantics into pre-trained generative models (T5 in our case) by performing secondary pre-training on labeled sentences from a variety of domains.
arXiv Detail & Related papers (2022-04-14T17:33:34Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Does Head Label Help for Long-Tailed Multi-Label Text Classification [45.762555329467446]
In real applications, the distribution of label frequency often exhibits a long tail, i.e., a few labels are associated with a large number of documents.
We propose a Head-to-Tail Network (HTTN) to transfer the meta-knowledge from the data-rich head labels to data-poor tail labels.
arXiv Detail & Related papers (2021-01-24T12:31:39Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z) - Label-Wise Document Pre-Training for Multi-Label Text Classification [14.439051753832032]
This paper develops Label-Wise Pre-Training (LW-PT) method to get a document representation with label-aware information.
The basic idea is that, a multi-label document can be represented as a combination of multiple label-wise representations, and that, correlated labels always cooccur in the same or similar documents.
arXiv Detail & Related papers (2020-08-15T10:34:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.