Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems
- URL: http://arxiv.org/abs/2505.18366v1
- Date: Fri, 23 May 2025 20:51:20 GMT
- Title: Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems
- Authors: Hansa Meghwani, Amit Agarwal, Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Srikant Panda,
- Abstract summary: We propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data.<n>Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models.
- Score: 2.4830284216463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models. Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15\% in MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability and readiness for real-world applications.
Related papers
- MetaGen Blended RAG: Higher Accuracy for Domain-Specific Q&A Without Fine-Tuning [0.0]
We propose an approach for Enterprise Search that focuses on enhancing the retriever for a domain-specific corpus through hybrid query indexes and metadata enrichment.<n>This 'MetaGen Blended RAG' method constructs a metadata generation pipeline using key concepts, topics, and acronyms, and then creates a metadata-enriched hybrid index with boosted search queries.<n>On the PubMedQA benchmark for the biomedical domain, the proposed method achieves 82% retrieval accuracy and 77% RAG accuracy, surpassing all previous RAG accuracy results without fine-tuning and sets a new benchmark for zero-shot results while outperforming much larger models like GPT3.5.
arXiv Detail & Related papers (2025-05-23T17:18:45Z) - Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration [0.0]
Archias is an expert model adept at distinguishing between in-domain and out-of-domain communications.<n>Archias classifies user inquiries into several categories: in-domain (specifically for the automotive industry), malicious questions, price injections, prompt injections, and out-of-domain examples.<n>Archias can be adjusted, fine-tuned, and used for many different purposes due to its small size.
arXiv Detail & Related papers (2025-05-18T16:13:07Z) - TechniqueRAG: Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text [11.417612899344697]
Accurately identifying adversarial techniques in security texts is critical for effective cyber defense.<n>Existing methods face a fundamental trade-off: they either rely on generic models with limited domain precision or require resource-intensive pipelines.<n>We propose TechniqueRAG, a domain-specific retrieval-augmented generation (RAG) framework that bridges this gap by integrating off-the-shelf retrievers, instruction-tuned LLMs, and minimal text-technique pairs.
arXiv Detail & Related papers (2025-05-17T12:46:10Z) - Exploiting Aggregation and Segregation of Representations for Domain Adaptive Human Pose Estimation [50.31351006532924]
Human pose estimation (HPE) has received increasing attention recently due to its wide application in motion analysis, virtual reality, healthcare, etc.<n>It suffers from the lack of labeled diverse real-world datasets due to the time- and labor-intensive annotation.<n>We introduce a novel framework that capitalizes on both representation aggregation and segregation for domain adaptive human pose estimation.
arXiv Detail & Related papers (2024-12-29T17:59:45Z) - Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.<n>We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z) - StyDeSty: Min-Max Stylization and Destylization for Single Domain Generalization [85.18995948334592]
Single domain generalization (single DG) aims at learning a robust model generalizable to unseen domains from only one training domain.
State-of-the-art approaches have mostly relied on data augmentations, such as adversarial perturbation and style enhancement, to synthesize new data.
We propose emphStyDeSty, which explicitly accounts for the alignment of the source and pseudo domains in the process of data augmentation.
arXiv Detail & Related papers (2024-06-01T02:41:34Z) - Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization [1.1534313664323637]
Domain shift is a formidable issue in Machine Learning that causes a model to suffer from performance degradation when tested on unseen domains.
FedDG attempts to train a global model using collaborative clients in a privacy-preserving manner that can generalize well to unseen clients possibly with domain shift.
Here, we introduce a novel architectural method for FedDG, namely gPerXAN, which relies on a normalization scheme working with a guiding regularizer.
arXiv Detail & Related papers (2024-03-22T20:22:08Z) - Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation [84.82153655786183]
We propose a novel framework called Informative Data Mining (IDM) to enable efficient one-shot domain adaptation for semantic segmentation.
IDM provides an uncertainty-based selection criterion to identify the most informative samples, which facilitates quick adaptation and reduces redundant training.
Our approach outperforms existing methods and achieves a new state-of-the-art one-shot performance of 56.7%/55.4% on the GTA5/SYNTHIA to Cityscapes adaptation tasks.
arXiv Detail & Related papers (2023-09-25T15:56:01Z) - Combining Data Generation and Active Learning for Low-Resource Question Answering [23.755283239897132]
We propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low-resource settings.
Our findings show that our novel approach, where humans are incorporated in a data generation approach, boosts performance in the low-resource, domain-specific setting.
arXiv Detail & Related papers (2022-11-27T16:31:33Z) - META: Mimicking Embedding via oThers' Aggregation for Generalizable
Person Re-identification [68.39849081353704]
Domain generalizable (DG) person re-identification (ReID) aims to test across unseen domains without access to the target domain data at training time.
This paper presents a new approach called Mimicking Embedding via oThers' Aggregation (META) for DG ReID.
arXiv Detail & Related papers (2021-12-16T08:06:50Z) - Style Normalization and Restitution for DomainGeneralization and
Adaptation [88.86865069583149]
An effective domain generalizable model is expected to learn feature representations that are both generalizable and discriminative.
In this paper, we design a novel Style Normalization and Restitution module (SNR) to ensure both high generalization and discrimination capability of the networks.
arXiv Detail & Related papers (2021-01-03T09:01:39Z) - Learning to Cluster under Domain Shift [20.00056591000625]
In this work we address the problem of transferring knowledge from a source to a target domain when both source and target data have no annotations.
Inspired by recent works on deep clustering, our approach leverages information from data gathered from multiple source domains.
We show that our method is able to automatically discover relevant semantic information even in presence of few target samples.
arXiv Detail & Related papers (2020-08-11T12:03:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.