Hierarchical Text Classification (HTC) vs. eXtreme Multilabel Classification (XML): Two Sides of the Same Medal
- URL: http://arxiv.org/abs/2411.13687v1
- Date: Wed, 20 Nov 2024 20:07:25 GMT
- Title: Hierarchical Text Classification (HTC) vs. eXtreme Multilabel Classification (XML): Two Sides of the Same Medal
- Authors: Nerijus Bertalis, Paul Granse, Ferhat Gül, Florian Hauss, Leon Menkel, David Schüler, Tom Speier, Lukas Galke, Ansgar Scherp,
- Abstract summary: Hierarchical Text Classification (HTC) focuses on datasets with smaller label pools of hundreds of entries, accompanied by a semantic label hierarchy.
eXtreme Multi-Label Text Classification (XML) considers very large label pools with up to millions of entries, in which the labels are not arranged in any particular manner.
Here, we investigate how state-of-the-art models from one domain perform when trained and tested on datasets from the other domain.
- Score: 4.750005231187266
- License:
- Abstract: Assigning a subset of labels from a fixed pool of labels to a given input text is a text classification problem with many real-world applications, such as in recommender systems. Two separate research streams address this issue. Hierarchical Text Classification (HTC) focuses on datasets with smaller label pools of hundreds of entries, accompanied by a semantic label hierarchy. In contrast, eXtreme Multi-Label Text Classification (XML) considers very large label pools with up to millions of entries, in which the labels are not arranged in any particular manner. However, in XML, a common approach is to construct an artificial hierarchy without any semantic information before or during the training process. Here, we investigate how state-of-the-art models from one domain perform when trained and tested on datasets from the other domain. The HBGL and HGLCR models from the HTC domain are trained and tested on the datasets Wiki10-31K, AmazonCat-13K, and Amazon-670K from the XML domain. On the other side, the XML models CascadeXML and XR-Transformer are trained and tested on the datasets Web of Science, The New York Times Annotated Corpus, and RCV1-V2 from the HTC domain. HTC models, on the other hand, are not equipped to handle the size of XML datasets and achieve poor transfer results. The code and numerous files that are needed to reproduce our results can be obtained from https://github.com/FloHauss/XMC_HTC
Related papers
- Utilizing Local Hierarchy with Adversarial Training for Hierarchical Text Classification [30.353876890557984]
Hierarchical text classification (HTC) is a challenging subtask due to its complex taxonomic structure.
We propose a HiAdv framework that can fit in nearly all HTC models and optimize them with the local hierarchy as auxiliary information.
arXiv Detail & Related papers (2024-02-29T03:20:45Z) - LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories [59.14011485494713]
This work introduces a fully automated 2D/3D labeling framework that can generate labels for RGB-D scans at equal (or better) level of accuracy.
We demonstrate the effectiveness of our LabelMaker pipeline by generating significantly better labels for the ScanNet datasets and automatically labelling the previously unlabeled ARKitScenes dataset.
arXiv Detail & Related papers (2023-11-20T20:40:24Z) - MatchXML: An Efficient Text-label Matching Framework for Extreme
Multi-label Text Classification [13.799733640048672]
The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from a large-scale label set.
We propose MatchXML, an efficient text-label matching framework for XMC.
Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets.
arXiv Detail & Related papers (2023-08-25T02:32:36Z) - A Survey on Extreme Multi-label Learning [72.8751573611815]
Multi-label learning has attracted significant attention from both academic and industry field in recent decades.
It is infeasible to directly adapt them to extremely large label space because of the compute and memory overhead.
eXtreme Multi-label Learning (XML) is becoming an important task and many effective approaches are proposed.
arXiv Detail & Related papers (2022-10-08T08:31:34Z) - Extreme Zero-Shot Learning for Extreme Text Classification [80.95271050744624]
Extreme Zero-Shot XMC (EZ-XMC) and Few-Shot XMC (FS-XMC) are investigated.
We propose to pre-train Transformer-based encoders with self-supervised contrastive losses.
We develop a pre-training method MACLR, which thoroughly leverages the raw text with techniques including Multi-scale Adaptive Clustering, Label Regularization, and self-training with pseudo positive pairs.
arXiv Detail & Related papers (2021-12-16T06:06:42Z) - DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short
Text Documents [10.573976360424473]
This paper develops the DeepXML framework that addresses the challenges by decomposing the deep extreme multi-label task into four simpler sub-tasks each of which can be trained accurately and efficiently.
DeepXML yields the Astec algorithm that could be 2-12% more accurate and 5-30x faster to train than leading deep extreme classifiers on publically available short text datasets.
Astec could also efficiently train on Bing short text datasets containing up to 62 million labels while making predictions for billions of users and data points per day on commodity hardware.
arXiv Detail & Related papers (2021-11-12T12:25:23Z) - HTCInfoMax: A Global Model for Hierarchical Text Classification via
Information Maximization [75.45291796263103]
The current state-of-the-art model HiAGM for hierarchical text classification has two limitations.
It correlates each text sample with all labels in the dataset which contains irrelevant information.
We propose HTCInfoMax to address these issues by introducing information which includes two modules.
arXiv Detail & Related papers (2021-04-12T06:04:20Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - LightXML: Transformer with Dynamic Negative Sampling for
High-Performance Extreme Multi-label Text Classification [27.80266694835677]
Extreme Multi-label text Classification (XMC) is a task of finding the most relevant labels from a large label set.
We propose LightXML, which adopts end-to-end training and dynamic negative labels sampling.
In experiments, LightXML outperforms state-of-the-art methods in five extreme multi-label datasets.
arXiv Detail & Related papers (2021-01-09T07:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.