Improving Large-Scale k-Nearest Neighbor Text Categorization with Label
Autoencoders
- URL: http://arxiv.org/abs/2402.01963v1
- Date: Sat, 3 Feb 2024 00:11:29 GMT
- Title: Improving Large-Scale k-Nearest Neighbor Text Categorization with Label
Autoencoders
- Authors: Francisco J. Ribadas-Pena, Shuyuan Cao, V\'ictor M. Darriba Bilbao
- Abstract summary: We introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections.
The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm.
We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a multi-label lazy learning approach to deal with
automatic semantic indexing in large document collections in the presence of
complex and structured label vocabularies with high inter-label correlation.
The proposed method is an evolution of the traditional k-Nearest Neighbors
algorithm which uses a large autoencoder trained to map the large label space
to a reduced size latent space and to regenerate the predicted labels from this
latent space. We have evaluated our proposal in a large portion of the MEDLINE
biomedical document collection which uses the Medical Subject Headings (MeSH)
thesaurus as a controlled vocabulary. In our experiments we propose and
evaluate several document representation approaches and different label
autoencoder configurations.
Related papers
- Prototypical Extreme Multi-label Classification with a Dynamic Margin Loss [6.244642999033755]
Extreme Multi-label Classification (XMC) methods predict relevant labels for a given query in an extremely large label space.
Recent works in XMC address this problem using deep encoders that project text descriptions to an embedding space suitable for recovering the closest labels.
We propose PRIME, a XMC method that employs a novel prototypical contrastive learning technique to reconcile efficiency and performance surpassing brute-force approaches.
arXiv Detail & Related papers (2024-10-27T10:24:23Z) - Data-driven Coreference-based Ontology Building [48.995395445597225]
Coreference resolution is traditionally used as a component in individual document understanding.
We take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations.
We release the coreference chains resulting under a creative-commons license, along with the code.
arXiv Detail & Related papers (2024-10-22T14:30:40Z) - Text2Tree: Aligning Text Representation to the Label Tree Hierarchy for
Imbalanced Medical Classification [9.391704905671476]
This paper aims to rethink the data challenges in medical texts and present a novel framework-agnostic algorithm called Text2Tree.
We embed the ICD code tree structure of labels into cascade attention modules for learning hierarchy-aware label representations.
Two new learning schemes, Similarity Surrogate Learning (SSL) and Dissimilarity Mixup Learning (DML), are devised to boost text classification by reusing and distinguishing samples of other labels.
arXiv Detail & Related papers (2023-11-28T10:02:08Z) - Weakly-Supervised Scientific Document Classification via
Retrieval-Augmented Multi-Stage Training [24.2734548438594]
We propose a weakly-supervised approach for scientific document classification using label names only.
In scientific domains, label names often include domain-specific concepts that may not appear in the document corpus.
We show that WANDER outperforms the best baseline by 11.9% on average.
arXiv Detail & Related papers (2023-06-12T15:50:13Z) - Retrieval-augmented Multi-label Text Classification [20.100081284294973]
Multi-label text classification is a challenging task in settings of large label sets.
Retrieval augmentation aims to improve the sample efficiency of classification models.
We evaluate this approach on four datasets from the legal and biomedical domains.
arXiv Detail & Related papers (2023-05-22T14:16:23Z) - Exploring Structured Semantic Prior for Multi Label Recognition with
Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging.
Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations.
We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z) - Label Semantics for Few Shot Named Entity Recognition [68.01364012546402]
We study the problem of few shot learning for named entity recognition.
We leverage the semantic information in the names of the labels as a way of giving the model additional signal and enriched priors.
Our model learns to match the representations of named entities computed by the first encoder with label representations computed by the second encoder.
arXiv Detail & Related papers (2022-03-16T23:21:05Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Label-Wise Document Pre-Training for Multi-Label Text Classification [14.439051753832032]
This paper develops Label-Wise Pre-Training (LW-PT) method to get a document representation with label-aware information.
The basic idea is that, a multi-label document can be represented as a combination of multiple label-wise representations, and that, correlated labels always cooccur in the same or similar documents.
arXiv Detail & Related papers (2020-08-15T10:34:27Z) - Interaction Matching for Long-Tail Multi-Label Classification [57.262792333593644]
We present an elegant and effective approach for addressing limitations in existing multi-label classification models.
By performing soft n-gram interaction matching, we match labels with natural language descriptions.
arXiv Detail & Related papers (2020-05-18T15:27:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.