Seeded Hierarchical Clustering for Expert-Crafted Taxonomies
- URL: http://arxiv.org/abs/2205.11602v1
- Date: Mon, 23 May 2022 19:58:06 GMT
- Title: Seeded Hierarchical Clustering for Expert-Crafted Taxonomies
- Authors: Anish Saha, Amith Ananthram, Emily Allaway, Heng Ji, Kathleen McKeown
- Abstract summary: We propose HierSeed, a weakly supervised algorithm for fitting unlabeled corpora.
It is both data and efficient.
It outperforms both unsupervised and supervised baselines for the SHC task on three real-world datasets.
- Score: 48.10324642720299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Practitioners from many disciplines (e.g., political science) use
expert-crafted taxonomies to make sense of large, unlabeled corpora. In this
work, we study Seeded Hierarchical Clustering (SHC): the task of automatically
fitting unlabeled data to such taxonomies using only a small set of labeled
examples. We propose HierSeed, a novel weakly supervised algorithm for this
task that uses only a small set of labeled seed examples. It is both data and
computationally efficient. HierSeed assigns documents to topics by weighing
document density against topic hierarchical structure. It outperforms both
unsupervised and supervised baselines for the SHC task on three real-world
datasets.
Related papers
- HiLight: A Hierarchy-aware Light Global Model with Hierarchical Local ConTrastive Learning [3.889612454093451]
Hierarchical text classification (HTC) is a sub-task of multi-label classification (MLC)
We propose a new learning task to introduce the hierarchical information, called Hierarchical Local Contrastive Learning (HiLCL)
arXiv Detail & Related papers (2024-08-11T14:26:58Z) - TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision [41.05874642535256]
Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy.
Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data.
We work on hierarchical text classification with the minimal amount of supervision: using the sole class name of each node as the only supervision.
arXiv Detail & Related papers (2024-02-29T22:26:07Z) - Adopting the Multi-answer Questioning Task with an Auxiliary Metric for
Extreme Multi-label Text Classification Utilizing the Label Hierarchy [10.87653109398961]
This paper adopts the multi-answer questioning task for extreme multi-label classification.
This study adopts the proposed method and the evaluation metric to the legal domain.
arXiv Detail & Related papers (2023-03-02T08:40:31Z) - Hierarchical Multi-Label Classification of Scientific Documents [47.293189105900524]
We introduce a new dataset for hierarchical multi-label text classification of scientific papers called SciHTC.
This dataset contains 186,160 papers and 1,233 categories from the ACM CCS tree.
Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities.
arXiv Detail & Related papers (2022-11-05T04:12:57Z) - TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel
Topic Clusters [57.59286394188025]
We propose a novel framework for topic taxonomy completion, named TaxoCom.
TaxoCom discovers novel sub-topic clusters of terms and documents.
Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage.
arXiv Detail & Related papers (2022-01-18T07:07:38Z) - Generate, Annotate, and Learn: Generative Models Advance Self-Training
and Knowledge Distillation [58.64720318755764]
Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data.
Knowledge distillation (KD) has enabled compressing deep networks and ensembles, achieving the best results when distilling knowledge on fresh task-specific unlabeled examples.
We present a general framework called "generate, annotate, and learn (GAL)" that uses unconditional generative models to synthesize in-domain unlabeled data.
arXiv Detail & Related papers (2021-06-11T05:01:24Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - GPU-based Self-Organizing Maps for Post-Labeled Few-Shot Unsupervised
Learning [2.922007656878633]
Few-shot classification is a challenge in machine learning where the goal is to train a classifier using a very limited number of labeled examples.
We consider the problem of post-labeled few-shot unsupervised learning, a classification task where representations are learned in an unsupervised fashion, to be later labeled using very few annotated examples.
arXiv Detail & Related papers (2020-09-04T13:22:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.