Related papers: End-to-End Learning to Index and Search in Large Output Spaces

End-to-End Learning to Index and Search in Large Output Spaces

URL: http://arxiv.org/abs/2210.08410v1
Date: Sun, 16 Oct 2022 01:34:17 GMT
Title: End-to-End Learning to Index and Search in Large Output Spaces
Authors: Nilesh Gupta, Patrick H. Chen, Hsiang-Fu Yu, Cho-Jui Hsieh, Inderjit S Dhillon
Abstract summary: Extreme multi-label classification (XMC) is a popular framework for solving real-world problems. In this paper, we propose a novel method which relaxes the tree-based index to a specialized weighted graph-based index. ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels.
Score: 95.16066833532396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extreme multi-label classification (XMC) is a popular framework for solving many real-world problems that require accurate prediction from a very large number of potential output choices. A popular approach for dealing with the large label space is to arrange the labels into a shallow tree-based index and then learn an ML model to efficiently search this index via beam search. Existing methods initialize the tree index by clustering the label space into a few mutually exclusive clusters based on pre-defined features and keep it fixed throughout the training procedure. This approach results in a sub-optimal indexing structure over the label space and limits the search performance to the quality of choices made during the initialization of the index. In this paper, we propose a novel method ELIAS which relaxes the tree-based index to a specialized weighted graph-based index which is learned end-to-end with the final task objective. More specifically, ELIAS models the discrete cluster-to-label assignments in the existing tree-based index as soft learnable parameters that are learned jointly with the rest of the ML model. ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels. In particular, ELIAS can be up to 2.5% better at precision@1 and up to 4% better at recall@100 than existing XMC methods. A PyTorch implementation of ELIAS along with other resources is available at https://github.com/nilesh2797/ELIAS.

Related papers

HAKES: Scalable Vector Database for Embedding Search Service [16.034584281180006]
We build a vector database that achieves high throughput and high recall under concurrent read-write workloads.<n>Our index outperforms index baselines in the high recall region and under concurrent read-write workloads.<n>namesys is scalable and achieves up to $16times$ higher throughputs than the baselines.
arXiv Detail & Related papers (2025-05-18T19:26:29Z)
LLMIdxAdvis: Resource-Efficient Index Advisor Utilizing Large Language Model [24.579793425796193]
We propose a resource-efficient index advisor that uses large language models (LLMs) without extensive fine-tuning. LLMs frames index recommendation as a sequence-to-sequence task, taking target workload, storage constraint, and corresponding database environment as input. Experiments on 3 OLAP and 2 real-world benchmarks reveal that LLMIdxAdvis delivers competitive index recommendation with reduced runtime.
arXiv Detail & Related papers (2025-03-10T22:01:24Z)
LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results. We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z)
Decision Making for Hierarchical Multi-label Classification with Multidimensional Local Precision Rate [4.812468844362369]
We introduce a new statistic called the multidimensional local precision rate (mLPR) for each object in each class. We show that classification decisions made by simply sorting objects across classes in descending order of their mLPRs can, in theory, ensure the class hierarchy. In response, we introduce HierRank, a new algorithm that maximizes an empirical version of CATCH using estimated mLPRs while respecting the hierarchy.
arXiv Detail & Related papers (2022-05-16T17:43:35Z)
Autoregressive Search Engines: Generating Substrings as Document Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers. Previous work has explored ways to partition the search space into hierarchical structures. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z)
A Learned Index for Exact Similarity Search in Metric Spaces [25.330353637669386]
LIMS is proposed to use data clustering and pivot-based data transformation techniques to build learned indexes. Machine learning models are developed to approximate the position of each data record on the disk. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of LIMS compared with traditional indexes.
arXiv Detail & Related papers (2022-04-21T11:24:55Z)
Label Disentanglement in Partition-based Extreme Multilabel Classification [111.25321342479491]
We show that the label assignment problem in partition-based XMC can be formulated as an optimization problem. We show that our method can successfully disentangle multi-modal labels, leading to state-of-the-art (SOTA) results on four XMC benchmarks.
arXiv Detail & Related papers (2021-06-24T03:24:18Z)
IRLI: Iterative Re-partitioning for Learning to Index [104.72641345738425]
Methods have to trade between obtaining high accuracy while maintaining load balance and scalability in distributed settings. We propose a novel approach called IRLI, which iteratively partitions the items by learning the relevant buckets directly from the query-item relevance data. We mathematically show that IRLI retrieves the correct item with high probability under very natural assumptions and provides superior load balancing.
arXiv Detail & Related papers (2021-03-17T23:13:25Z)
Probabilistic Label Trees for Extreme Multi-label Classification [8.347190888362194]
Problems of extreme multi-label classification (XMLC) are efficiently handled by organizing labels as a tree. PLTs can be treated as a generalization of hierarchical softmax for multi-label problems. We introduce the model and discuss training and inference procedures and their computational costs. We prove a specific equivalence between the fully online algorithm and an algorithm with a tree structure given in advance.
arXiv Detail & Related papers (2020-09-23T15:30:00Z)
The Case for Learned Spatial Indexes [62.88514422115702]
We use techniques proposed from a state-of-the art learned multi-dimensional index structure (namely, Flood) to answer spatial range queries. We show that (i) machine learned search within a partition is faster by 11.79% to 39.51% than binary search when using filtering on one dimension. We also refine using machine learned indexes is 1.23x to 1.83x times faster than closest competitor which filters on two dimensions.
arXiv Detail & Related papers (2020-08-24T12:09:55Z)
Tree Index: A New Cluster Evaluation Technique [2.790947019327459]
We introduce a cluster evaluation technique called Tree Index. Our Tree Index is finding margins amongst clusters for easy learning without the complications of Minimum Description Length. We show that, on the clustering results (obtained by various techniques) on a brain dataset, Tree Index discriminates between reasonable and non-sensible clusters.
arXiv Detail & Related papers (2020-03-24T13:41:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.