On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification
- URL: http://arxiv.org/abs/2408.09585v1
- Date: Sun, 18 Aug 2024 20:08:42 GMT
- Title: On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification
- Authors: Jatin Prakash, Anirudh Buvanesh, Bishal Santra, Deepak Saini, Sachin Yadav, Jian Jiao, Yashoteja Prabhu, Amit Sharma, Manik Varma,
- Abstract summary: Extreme Classification (XC) aims to map a query to the most relevant documents from a very large document set.
We observe that systematic missing labels lead to missing knowledge, which is critical for accurately modelling relevance between queries and documents.
We propose SKIM, an algorithm that leverages a combination of small LM and abundant unstructured meta-data to effectively mitigate the missing label problem.
- Score: 17.309987565818577
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extreme Classification (XC) aims to map a query to the most relevant documents from a very large document set. XC algorithms used in real-world applications learn this mapping from datasets curated from implicit feedback, such as user clicks. However, these datasets inevitably suffer from missing labels. In this work, we observe that systematic missing labels lead to missing knowledge, which is critical for accurately modelling relevance between queries and documents. We formally show that this absence of knowledge cannot be recovered using existing methods such as propensity weighting and data imputation strategies that solely rely on the training dataset. While LLMs provide an attractive solution to augment the missing knowledge, leveraging them in applications with low latency requirements and large document sets is challenging. To incorporate missing knowledge at scale, we propose SKIM (Scalable Knowledge Infusion for Missing Labels), an algorithm that leverages a combination of small LM and abundant unstructured meta-data to effectively mitigate the missing label problem. We show the efficacy of our method on large-scale public datasets through exhaustive unbiased evaluation ranging from human annotations to simulations inspired from industrial settings. SKIM outperforms existing methods on Recall@100 by more than 10 absolute points. Additionally, SKIM scales to proprietary query-ad retrieval datasets containing 10 million documents, outperforming contemporary methods by 12% in offline evaluation and increased ad click-yield by 1.23% in an online A/B test conducted on a popular search engine. We release our code, prompts, trained XC models and finetuned SLMs at: https://github.com/bicycleman15/skim
Related papers
- IDALC: A Semi-Supervised Framework for Intent Detection and Active Learning based Correction [29.961460339925424]
IDALC is a semi-supervised framework designed to detect user intents and rectify system-rejected utterances.<n>We maintain the overall annotation cost at just 6-10% of the unlabelled data available to the system.
arXiv Detail & Related papers (2025-11-08T08:32:59Z) - Maximally-Informative Retrieval for State Space Model Generation [59.954191072042526]
We introduce Retrieval In-Context Optimization (RICO) to minimize model uncertainty for a particular query at test-time.<n>Unlike traditional retrieval-augmented generation (RAG), which relies on externals for document retrieval, our approach leverages direct feedback from the model.<n>We show that standard top-$k$ retrieval with model gradients can approximate our optimization procedure, and provide connections to the leave-one-out loss.
arXiv Detail & Related papers (2025-06-13T18:08:54Z) - DocMMIR: A Framework for Document Multi-modal Information Retrieval [21.919132888183622]
We introduce DocMMIR, a novel multi-modal document retrieval framework.<n>We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples.<n>Results show a +31% improvement in MRR@10 compared to the zero-shot baseline.
arXiv Detail & Related papers (2025-05-25T20:58:58Z) - Document Attribution: Examining Citation Relationships using Large Language Models [62.46146670035751]
We propose a zero-shot approach that frames attribution as a straightforward textual entailment task.<n>We also explore the role of the attention mechanism in enhancing the attribution process.
arXiv Detail & Related papers (2025-05-09T04:40:11Z) - Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning [76.50690734636477]
We introduce Rank-R1, a novel LLM-based reranker that performs reasoning over both the user query and candidate documents before performing the ranking task.
Our experiments on the TREC DL and BRIGHT datasets show that Rank-R1 is highly effective, especially for complex queries.
arXiv Detail & Related papers (2025-03-08T03:14:26Z) - Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets [8.1990111961557]
We investigate the behavior of state-of-the-art retrieval algorithms on massive datasets.
We compare and contrast the recently-proposed Seismic and graph-based solutions adapted from dense retrieval.
We extensively evaluate Splade embeddings of 138M passages from MsMarco-v2 and report indexing time and other efficiency and effectiveness metrics.
arXiv Detail & Related papers (2025-01-20T17:59:21Z) - Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets.
LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student.
Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - DCA-Bench: A Benchmark for Dataset Curation Agents [9.60250892491588]
Data quality issues, such as incomplete documentation, inaccurate labels, ethical concerns, and outdated information, remain common in widely used datasets.<n>With the surging ability of large language models (LLM), it's promising to streamline the discovery of hidden dataset issues with LLM agents.<n>In this work, we establish a benchmark to measure LLM agent's ability to tackle this challenge.
arXiv Detail & Related papers (2024-06-11T14:02:23Z) - Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery [6.037276428689637]
This paper introduces DISCOvery Graph (DISCOG), a hybrid approach that combines the strengths of two worlds: a graph-based method for accurate document relevance prediction.
Our approach drastically reduces document review costs by 99.9% compared to manual processes and by 95% compared to LLM-based classification methods.
arXiv Detail & Related papers (2024-05-29T15:08:55Z) - LORD: Leveraging Open-Set Recognition with Unknown Data [10.200937444995944]
LORD is a framework to Leverage Open-set Recognition by exploiting unknown data.
We identify three model-agnostic training strategies that exploit background data and applied them to well-established classifiers.
arXiv Detail & Related papers (2023-08-24T06:12:41Z) - CELDA: Leveraging Black-box Language Model as Enhanced Classifier
without Labels [14.285609493077965]
Clustering-enhanced Linear Discriminative Analysis, a novel approach that improves the text classification accuracy with a very weak-supervision signal.
Our framework draws a precise decision boundary without accessing weights or gradients of the LM model or data labels.
arXiv Detail & Related papers (2023-06-05T08:35:31Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - Prompt-driven efficient Open-set Semi-supervised Learning [52.30303262499391]
Open-set semi-supervised learning (OSSL) has attracted growing interest, which investigates a more practical scenario where out-of-distribution (OOD) samples are only contained in unlabeled data.
We propose a prompt-driven efficient OSSL framework, called OpenPrompt, which can propagate class information from labeled to unlabeled data with only a small number of trainable parameters.
arXiv Detail & Related papers (2022-09-28T16:25:08Z) - Spacing Loss for Discovering Novel Categories [72.52222295216062]
Novel Class Discovery (NCD) is a learning paradigm, where a machine learning model is tasked to semantically group instances from unlabeled data.
We first characterize existing NCD approaches into single-stage and two-stage methods based on whether they require access to labeled and unlabeled data together.
We devise a simple yet powerful loss function that enforces separability in the latent space using cues from multi-dimensional scaling.
arXiv Detail & Related papers (2022-04-22T09:37:11Z) - Towards Good Practices for Efficiently Annotating Large-Scale Image
Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images.
We propose modifications and best practices aimed at minimizing human labeling effort.
Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z) - End-to-End Learning from Noisy Crowd to Supervised Machine Learning
Models [6.278267504352446]
We advocate using hybrid intelligence, i.e., combining deep models and human experts, to design an end-to-end learning framework from noisy crowd-sourced data.
We show how label aggregation can benefit from estimating the annotators' confusion matrix to improve the learning process.
We demonstrate the effectiveness of our strategies on several image datasets, using SVM and deep neural networks.
arXiv Detail & Related papers (2020-11-13T09:48:30Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.