MatchXML: An Efficient Text-label Matching Framework for Extreme
Multi-label Text Classification
- URL: http://arxiv.org/abs/2308.13139v2
- Date: Mon, 11 Mar 2024 14:50:03 GMT
- Title: MatchXML: An Efficient Text-label Matching Framework for Extreme
Multi-label Text Classification
- Authors: Hui Ye, Rajshekhar Sunderraman, Shihao Ji
- Abstract summary: The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from a large-scale label set.
We propose MatchXML, an efficient text-label matching framework for XMC.
Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets.
- Score: 13.799733640048672
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The eXtreme Multi-label text Classification(XMC) refers to training a
classifier that assigns a text sample with relevant labels from an extremely
large-scale label set (e.g., millions of labels). We propose MatchXML, an
efficient text-label matching framework for XMC. We observe that the label
embeddings generated from the sparse Term Frequency-Inverse Document
Frequency(TF-IDF) features have several limitations. We thus propose label2vec
to effectively train the semantic dense label embeddings by the Skip-gram
model. The dense label embeddings are then used to build a Hierarchical Label
Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we
formulate the multi-label text classification as a text-label matching problem
in a bipartite graph. We then extract the dense text representations from the
fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also
extract the static dense sentence embeddings from a pre-trained Sentence
Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF
features, the fine-tuned dense text representations and static dense sentence
features. Experimental results demonstrate that MatchXML achieves
state-of-the-art accuracy on five out of six datasets. As for the speed,
MatchXML outperforms the competing methods on all the six datasets. Our source
code is publicly available at https://github.com/huiyegit/MatchXML.
Related papers
- Modeling Text-Label Alignment for Hierarchical Text Classification [12.579592946863762]
Hierarchical Text Classification (HTC) aims to categorize text data based on a structured label hierarchy, resulting in predicted labels forming a sub-hierarchy tree.
With the sub-hierarchy changing for each sample, the dynamic nature of text-label alignment poses challenges for existing methods.
We propose a Text-Label Alignment (TLA) loss specifically designed to model the alignment between text and labels.
arXiv Detail & Related papers (2024-09-01T17:48:29Z) - Learning label-label correlations in Extreme Multi-label Classification via Label Features [44.00852282861121]
Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices.
Short-text XMC with label features has found numerous applications in areas such as query-to-ad-phrase matching in search ads, title-based product recommendation, prediction of related searches.
We propose Gandalf, a novel approach which makes use of a label co-occurrence graph to leverage label features as additional data points to supplement the training distribution.
arXiv Detail & Related papers (2024-05-03T21:18:43Z) - Exploiting Dynamic and Fine-grained Semantic Scope for Extreme
Multi-label Text Classification [12.508006325140949]
Extreme multi-label text classification (XMTC) refers to the problem of tagging a given text with the most relevant subset of labels from a large label set.
Most existing XMTC methods take advantage of fixed label clusters obtained in early stage to balance performance on tail labels and head labels.
We propose a novel framework TReaderXML for XMTC, which adopts dynamic and fine-grained semantic scope from teacher knowledge.
arXiv Detail & Related papers (2022-05-24T11:15:35Z) - Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM.
Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z) - Label Disentanglement in Partition-based Extreme Multilabel
Classification [111.25321342479491]
We show that the label assignment problem in partition-based XMC can be formulated as an optimization problem.
We show that our method can successfully disentangle multi-modal labels, leading to state-of-the-art (SOTA) results on four XMC benchmarks.
arXiv Detail & Related papers (2021-06-24T03:24:18Z) - HTCInfoMax: A Global Model for Hierarchical Text Classification via
Information Maximization [75.45291796263103]
The current state-of-the-art model HiAGM for hierarchical text classification has two limitations.
It correlates each text sample with all labels in the dataset which contains irrelevant information.
We propose HTCInfoMax to address these issues by introducing information which includes two modules.
arXiv Detail & Related papers (2021-04-12T06:04:20Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - LightXML: Transformer with Dynamic Negative Sampling for
High-Performance Extreme Multi-label Text Classification [27.80266694835677]
Extreme Multi-label text Classification (XMC) is a task of finding the most relevant labels from a large label set.
We propose LightXML, which adopts end-to-end training and dynamic negative labels sampling.
In experiments, LightXML outperforms state-of-the-art methods in five extreme multi-label datasets.
arXiv Detail & Related papers (2021-01-09T07:04:18Z) - Unsupervised Label Refinement Improves Dataless Text Classification [48.031421660674745]
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
arXiv Detail & Related papers (2020-12-08T03:37:50Z) - MixText: Linguistically-Informed Interpolation of Hidden Space for
Semi-Supervised Text Classification [68.15015032551214]
MixText is a semi-supervised learning method for text classification.
TMix creates a large amount of augmented training samples by interpolating text in hidden space.
We leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data.
arXiv Detail & Related papers (2020-04-25T21:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.