End-to-End Learning to Index and Search in Large Output Spaces
- URL: http://arxiv.org/abs/2210.08410v1
- Date: Sun, 16 Oct 2022 01:34:17 GMT
- Title: End-to-End Learning to Index and Search in Large Output Spaces
- Authors: Nilesh Gupta, Patrick H. Chen, Hsiang-Fu Yu, Cho-Jui Hsieh, Inderjit S
Dhillon
- Abstract summary: Extreme multi-label classification (XMC) is a popular framework for solving real-world problems.
In this paper, we propose a novel method which relaxes the tree-based index to a specialized weighted graph-based index.
ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels.
- Score: 95.16066833532396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extreme multi-label classification (XMC) is a popular framework for solving
many real-world problems that require accurate prediction from a very large
number of potential output choices. A popular approach for dealing with the
large label space is to arrange the labels into a shallow tree-based index and
then learn an ML model to efficiently search this index via beam search.
Existing methods initialize the tree index by clustering the label space into a
few mutually exclusive clusters based on pre-defined features and keep it fixed
throughout the training procedure. This approach results in a sub-optimal
indexing structure over the label space and limits the search performance to
the quality of choices made during the initialization of the index. In this
paper, we propose a novel method ELIAS which relaxes the tree-based index to a
specialized weighted graph-based index which is learned end-to-end with the
final task objective. More specifically, ELIAS models the discrete
cluster-to-label assignments in the existing tree-based index as soft learnable
parameters that are learned jointly with the rest of the ML model. ELIAS
achieves state-of-the-art performance on several large-scale extreme
classification benchmarks with millions of labels. In particular, ELIAS can be
up to 2.5% better at precision@1 and up to 4% better at recall@100 than
existing XMC methods. A PyTorch implementation of ELIAS along with other
resources is available at https://github.com/nilesh2797/ELIAS.
Related papers
- LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance.
There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results.
We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z) - Decision Making for Hierarchical Multi-label Classification with
Multidimensional Local Precision Rate [4.812468844362369]
We introduce a new statistic called the multidimensional local precision rate (mLPR) for each object in each class.
We show that classification decisions made by simply sorting objects across classes in descending order of their mLPRs can, in theory, ensure the class hierarchy.
In response, we introduce HierRank, a new algorithm that maximizes an empirical version of CATCH using estimated mLPRs while respecting the hierarchy.
arXiv Detail & Related papers (2022-05-16T17:43:35Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes.
Machine learning models are developed to approximate the position of each data record on the disk.
Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z) - Label Disentanglement in Partition-based Extreme Multilabel
Classification [111.25321342479491]
We show that the label assignment problem in partition-based XMC can be formulated as an optimization problem.
We show that our method can successfully disentangle multi-modal labels, leading to state-of-the-art (SOTA) results on four XMC benchmarks.
arXiv Detail & Related papers (2021-06-24T03:24:18Z) - IRLI: Iterative Re-partitioning for Learning to Index [104.72641345738425]
Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings.
We propose a novel approach called IRLI, which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data.
We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing.
arXiv Detail & Related papers (2021-03-17T23:13:25Z) - Probabilistic Label Trees for Extreme Multi-label Classification [8.347190888362194]
Problems of extreme multi-label classification (XMLC) are efficiently handled by organizing labels as a tree.
PLTs can be treated as a generalization of hierarchical softmax for multi-label problems.
We introduce the model and discuss training and inference procedures and their computational costs.
We prove a specific equivalence between the fully online algorithm and an algorithm with a tree structure given in advance.
arXiv Detail & Related papers (2020-09-23T15:30:00Z) - The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries.
We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension.
We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z) - Tree Index: A New Cluster Evaluation Technique [2.790947019327459]
We introduce a cluster evaluation technique called Tree Index.
Our Tree Index is finding margins amongst clusters for easy learning without the complications of Minimum Description Length.
We show that, on the clustering results (obtained by various techniques) on a brain dataset, Tree Index discriminates between reasonable and non-sensible clusters.
arXiv Detail & Related papers (2020-03-24T13:41:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.