MATCH: Metadata-Aware Text Classification in A Large Hierarchy
- URL: http://arxiv.org/abs/2102.07349v1
- Date: Mon, 15 Feb 2021 05:23:08 GMT
- Title: MATCH: Metadata-Aware Text Classification in A Large Hierarchy
- Authors: Yu Zhang, Zhihong Shen, Yuxiao Dong, Kuansan Wang, Jiawei Han
- Abstract summary: MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
- Score: 60.59183151617578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-label text classification refers to the problem of assigning each given
document its most relevant labels from the label set. Commonly, the metadata of
the given documents and the hierarchy of the labels are available in real-world
applications. However, most existing studies focus on only modeling the text
information, with a few attempts to utilize either metadata or hierarchy
signals, but not both of them. In this paper, we bridge the gap by formalizing
the problem of metadata-aware text classification in a large label hierarchy
(e.g., with tens of thousands of labels). To address this problem, we present
the MATCH solution -- an end-to-end framework that leverages both metadata and
hierarchy information. To incorporate metadata, we pre-train the embeddings of
text and metadata in the same space and also leverage the fully-connected
attentions to capture the interrelations between them. To leverage the label
hierarchy, we propose different ways to regularize the parameters and output
probability of each child label by its parents. Extensive experiments on two
massive text datasets with large-scale label hierarchies demonstrate the
effectiveness of MATCH over state-of-the-art deep learning baselines.
Related papers
- Open-world Multi-label Text Classification with Extremely Weak Supervision [30.85235057480158]
We study open-world multi-label text classification under extremely weak supervision (XWS)
We first utilize the user description to prompt a large language model (LLM) for dominant keyphrases of a subset of raw documents, and then construct a label space via clustering.
We then apply a zero-shot multi-label classifier to locate the documents with small top predicted scores, so we can revisit their dominant keyphrases for more long-tail labels.
X-MLClass exhibits a remarkable increase in ground-truth label space coverage on various datasets.
arXiv Detail & Related papers (2024-07-08T04:52:49Z) - Semi-Supervised Hierarchical Multi-Label Classifier Based on Local Information [1.6574413179773761]
Semi-supervised hierarchical multi-label classifier based on local information (SSHMC-BLI)
SSHMC-BLI builds pseudo-labels for each unlabeled instance from the paths of labels of its labeled neighbors.
Experiments on 12 challenging datasets from functional genomics show that making use of unlabeled along with labeled data can help to improve the performance of a supervised hierarchical classifier trained only on labeled data.
arXiv Detail & Related papers (2024-04-30T20:16:40Z) - SEAL: Simultaneous Label Hierarchy Exploration And Learning [9.701914280306118]
We propose a new framework that explores the label hierarchy by augmenting the observed labels with latent labels that follow a prior hierarchical structure.
Our approach uses a 1-Wasserstein metric over the tree metric space as an objective function, which enables us to simultaneously learn a data-driven label hierarchy and perform (semi-supervised) learning.
arXiv Detail & Related papers (2023-04-26T08:31:59Z) - Exploiting Dynamic and Fine-grained Semantic Scope for Extreme
Multi-label Text Classification [12.508006325140949]
Extreme multi-label text classification (XMTC) refers to the problem of tagging a given text with the most relevant subset of labels from a large label set.
Most existing XMTC methods take advantage of fixed label clusters obtained in early stage to balance performance on tail labels and head labels.
We propose a novel framework TReaderXML for XMTC, which adopts dynamic and fine-grained semantic scope from teacher knowledge.
arXiv Detail & Related papers (2022-05-24T11:15:35Z) - Use All The Labels: A Hierarchical Multi-Label Contrastive Learning
Framework [75.79736930414715]
We present a hierarchical multi-label representation learning framework that can leverage all available labels and preserve the hierarchical relationship between classes.
We introduce novel hierarchy preserving losses, which jointly apply a hierarchical penalty to the contrastive loss, and enforce the hierarchy constraint.
arXiv Detail & Related papers (2022-04-27T21:41:44Z) - MotifClass: Weakly Supervised Text Classification with Higher-order
Metadata Information [47.44278057062421]
We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only.
To be specific, we model the relationships between documents and metadata via a heterogeneous information network.
We propose a novel framework, named MotifClass, which selects category-indicative motif instances, retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances.
arXiv Detail & Related papers (2021-11-07T07:39:10Z) - HTCInfoMax: A Global Model for Hierarchical Text Classification via
Information Maximization [75.45291796263103]
The current state-of-the-art model HiAGM for hierarchical text classification has two limitations.
It correlates each text sample with all labels in the dataset which contains irrelevant information.
We propose HTCInfoMax to address these issues by introducing information which includes two modules.
arXiv Detail & Related papers (2021-04-12T06:04:20Z) - Unsupervised Label Refinement Improves Dataless Text Classification [48.031421660674745]
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
arXiv Detail & Related papers (2020-12-08T03:37:50Z) - Exploring the Hierarchy in Relation Labels for Scene Graph Generation [75.88758055269948]
The proposed method can improve several state-of-the-art baselines by a large margin (up to $33%$ relative gain) in terms of Recall@50.
Experiments show that the proposed simple yet effective method can improve several state-of-the-art baselines by a large margin.
arXiv Detail & Related papers (2020-09-12T17:36:53Z) - Minimally Supervised Categorization of Text with Metadata [40.13841133991089]
We propose MetaCat, a minimally supervised framework to categorize text with metadata.
We develop a generative process describing the relationships between words, documents, labels, and metadata.
Based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity.
arXiv Detail & Related papers (2020-05-01T21:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.