Related papers: Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)

Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)

URL: http://arxiv.org/abs/2507.03761v1
Date: Fri, 04 Jul 2025 18:17:52 GMT
Title: Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)
Authors: Celso França, Gestefane Rabbi, Thiago Salles, Washington Cunha, Leonardo Rocha, Marcos André Gonçalves,
Abstract summary: Long-tail distribution of labels is a significant challenge in Extreme Multi-label Text Classification (XMTC)<n>Labels can be broadly categorized into frequent, high-coverage textbfhead labels and infrequent, low-coverage textbftail labels<n>Sparse retrievers compute relevance scores based on high-dimensional, bag-of-words representations, while dense retrievers utilize approximate nearest neighbor (ANN) algorithms on dense text and label embeddings within a shared embedding space.<n>Rank-based fusion algorithms leverage these differences by combining the precise matching capabilities of sparse retrievers with the semantic richness of
Score: 7.817991268974576
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the context of Extreme Multi-label Text Classification (XMTC), where labels are assigned to text instances from a large label space, the long-tail distribution of labels presents a significant challenge. Labels can be broadly categorized into frequent, high-coverage \textbf{head labels} and infrequent, low-coverage \textbf{tail labels}, complicating the task of balancing effectiveness across all labels. To address this, combining predictions from multiple retrieval methods, such as sparse retrievers (e.g., BM25) and dense retrievers (e.g., fine-tuned BERT), offers a promising solution. The fusion of \textit{sparse} and \textit{dense} retrievers is motivated by the complementary ranking characteristics of these methods. Sparse retrievers compute relevance scores based on high-dimensional, bag-of-words representations, while dense retrievers utilize approximate nearest neighbor (ANN) algorithms on dense text and label embeddings within a shared embedding space. Rank-based fusion algorithms leverage these differences by combining the precise matching capabilities of sparse retrievers with the semantic richness of dense retrievers, thereby producing a final ranking that improves the effectiveness across both head and tail labels.

Related papers

LabelCoRank: Revolutionizing Long Tail Multi-Label Classification with Co-Occurrence Reranking [10.418399727644859]
Long tail challenges have persistently posed difficulties in accurately classifying less frequent labels.<n>This paper introduces LabelCoRank, a novel approach inspired by ranking principles.<n>LabelCoRank effectively mitigates long tail issues in multi-labeltext classification.
arXiv Detail & Related papers (2025-03-11T01:52:39Z)
MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification [13.799733640048672]
The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from a large-scale label set. We propose MatchXML, an efficient text-label matching framework for XMC. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets.
arXiv Detail & Related papers (2023-08-25T02:32:36Z)
Label Dependencies-aware Set Prediction Networks for Multi-label Text Classification [0.0]
We leverage Graph Convolutional Networks and construct an adjacency matrix based on the statistical relations between labels. We enhance recall ability by applying the Bhattacharyya distance to the output distributions of the set prediction networks.
arXiv Detail & Related papers (2023-04-14T09:31:17Z)
Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations. We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z)
Empowering Sentence Encoders with Prompting and Label Retrieval for Zero-shot Text Classification [5.484132137132862]
Our framework, RaLP, encodes prompted label candidates with a sentence encoder, then assigns the label whose prompt embedding has the highest similarity with the input text embedding. RaLP achieves competitive or stronger performance than much larger baselines on various closed-set classification and multiple-choice QA datasets.
arXiv Detail & Related papers (2022-12-20T16:18:03Z)
Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification [38.66674700075432]
We propose a Pairwise Instance Relation Augmentation Network (PIRAN) to augment tailed-label documents for balancing tail labels and head labels. PIRAN consistently outperforms the SOTA methods, and dramatically improves the performance of tail labels.
arXiv Detail & Related papers (2022-11-19T12:45:54Z)
Text Summarization with Oracle Expectation [88.39032981994535]
Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document. Most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy. We propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels.
arXiv Detail & Related papers (2022-09-26T14:10:08Z)
Multi-label Classification with High-rank and High-order Label Correlations [62.39748565407201]
Previous methods capture the high-order label correlations mainly by transforming the label matrix to a latent label space with low-rank matrix factorization. We propose a simple yet effective method to depict the high-order label correlations explicitly, and at the same time maintain the high-rank of the label matrix. Comparative studies over twelve benchmark data sets validate the effectiveness of the proposed algorithm in multi-label classification.
arXiv Detail & Related papers (2022-07-09T05:15:31Z)
Long-tailed Extreme Multi-label Text Classification with Generated Pseudo Label Descriptions [28.416742933744942]
This paper addresses the challenge of tail label prediction by proposing a novel approach. It combines the effectiveness of a trained bag-of-words (BoW) classifier in generating informative label descriptions under severe data scarce conditions. The proposed approach achieves state-of-the-art performance on XMTC benchmark datasets and significantly outperforms the best methods so far in the tail label prediction.
arXiv Detail & Related papers (2022-04-02T23:42:32Z)
Rank-Consistency Deep Hashing for Scalable Multi-Label Image Search [90.30623718137244]
We propose a novel deep hashing method for scalable multi-label image search. A new rank-consistency objective is applied to align the similarity orders from two spaces. A powerful loss function is designed to penalize the samples whose semantic similarity and hamming distance are mismatched.
arXiv Detail & Related papers (2021-02-02T13:46:58Z)
PseudoSeg: Designing Pseudo Labels for Semantic Segmentation [78.35515004654553]
We present a re-design of pseudo-labeling to generate structured pseudo labels for training with unlabeled or weakly-labeled data. We demonstrate the effectiveness of the proposed pseudo-labeling strategy in both low-data and high-data regimes.
arXiv Detail & Related papers (2020-10-19T17:59:30Z)
Interaction Matching for Long-Tail Multi-Label Classification [57.262792333593644]
We present an elegant and effective approach for addressing limitations in existing multi-label classification models. By performing soft n-gram interaction matching, we match labels with natural language descriptions.
arXiv Detail & Related papers (2020-05-18T15:27:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.