ESANS: Effective and Semantic-Aware Negative Sampling for Large-Scale Retrieval Systems
- URL: http://arxiv.org/abs/2502.16077v1
- Date: Sat, 22 Feb 2025 04:43:20 GMT
- Title: ESANS: Effective and Semantic-Aware Negative Sampling for Large-Scale Retrieval Systems
- Authors: Haibo Xing, Kanefumi Matsuyama, Hao Deng, Jinxin Hu, Yu Zhang, Xiaoyi Zeng,
- Abstract summary: In the retrieval stage, classic embedding-based retrieval methods depend on effective negative sampling techniques to enhance both performance and efficiency.<n>We propose Effective and Semantic-Aware Negative Sampling (ESANS), which integrates two key components: Effective Dense Interpolation Strategy (EDIS) and Multimodal Semantic-Aware Clustering (MSAC)
- Score: 7.897183317096681
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Industrial recommendation systems typically involve a two-stage process: retrieval and ranking, which aims to match users with millions of items. In the retrieval stage, classic embedding-based retrieval (EBR) methods depend on effective negative sampling techniques to enhance both performance and efficiency. However, existing techniques often suffer from false negatives, high cost for ensuring sampling quality and semantic information deficiency. To address these limitations, we propose Effective and Semantic-Aware Negative Sampling (ESANS), which integrates two key components: Effective Dense Interpolation Strategy (EDIS) and Multimodal Semantic-Aware Clustering (MSAC). EDIS generates virtual samples within the low-dimensional embedding space to improve the diversity and density of the sampling distribution while minimizing computational costs. MSAC refines the negative sampling distribution by hierarchically clustering item representations based on multimodal information (visual, textual, behavioral), ensuring semantic consistency and reducing false negatives. Extensive offline and online experiments demonstrate the superior efficiency and performance of ESANS.
Related papers
- Can LLM-Driven Hard Negative Sampling Empower Collaborative Filtering? Findings and Potentials [9.668242919588199]
Hard negative samples can accelerate model convergence and optimize decision boundaries.
This paper introduces the concept of Semantic Negative Sampling.
We propose a framework called HNLMRec, based on fine-tuning LLMs supervised by collaborative signals.
arXiv Detail & Related papers (2025-04-07T04:39:45Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.
We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Semantic Retrieval Augmented Contrastive Learning for Sequential Recommendation [17.18176550968383]
We propose a novel approach named Semantic Retrieval Augmented Contrastive Learning (SRA-CL), which leverages semantic information to improve the reliability of contrastive samples.
SRA-CL comprises two main components: (1) Cross-Sequence Contrastive Learning via User Semantic Retrieval, which utilizes large language models (LLMs) to understand diverse user preferences and retrieve semantically similar users to form reliable positive samples through a learnable sample method; and (2) Intra-Sequence Contrastive Learning via Item Semantic Retrieval, which employs LLMs to comprehend items and retrieve similar items to perform semantic-based item substitution
arXiv Detail & Related papers (2025-03-06T07:25:19Z) - Diversified Sampling Improves Scaling LLM inference [31.18762591875725]
DivSampling is a novel and versatile sampling technique designed to enhance the diversity of candidate solutions.<n>Our theoretical analysis demonstrates that, under mild assumptions, the error rates of responses generated from diverse prompts are significantly lower compared to those produced by stationary prompts.
arXiv Detail & Related papers (2025-02-16T07:37:58Z) - SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval [45.971786380884126]
The performance of Dense retrieval (DR) is significantly influenced by the quality of negative sampling.<n>Recent advancements in large language models (LLMs) offer an innovative solution by generating contextually rich and diverse negative samples.<n>In this work, we present a framework that harnesses LLMs to synthesize high-quality hard negative samples.
arXiv Detail & Related papers (2024-12-23T03:49:00Z) - Scalable and Effective Negative Sample Generation for Hyperedge Prediction [55.9298019975967]
Hyperedge prediction is crucial for understanding complex multi-entity interactions in web-based applications.
Traditional methods often face difficulties in generating high-quality negative samples due to imbalance between positive and negative instances.
We present the scalable and effective negative sample generation for Hyperedge Prediction (SEHP) framework, which utilizes diffusion models to tackle these challenges.
arXiv Detail & Related papers (2024-11-19T09:16:25Z) - A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems [67.52782366565658]
State-of-the-art recommender systems (RSs) depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables.<n>Despite the prosperity of lightweight embedding-based RSs, a wide diversity is seen in evaluation protocols.<n>This study investigates various LERS' performance, efficiency, and cross-task transferability via a thorough benchmarking process.
arXiv Detail & Related papers (2024-06-25T07:45:00Z) - Multi-Margin Cosine Loss: Proposal and Application in Recommender Systems [0.0]
Collaborative filtering-based deep learning techniques have regained popularity due to their straightforward nature.
These systems consist of three main components: an interaction module, a loss function, and a negative sampling strategy.
The proposed Multi-Margin Cosine Loss (MMCL) addresses these challenges by introducing multiple margins and varying weights for negative samples.
arXiv Detail & Related papers (2024-05-07T18:58:32Z) - Hybrid Representation-Enhanced Sampling for Bayesian Active Learning in
Musculoskeletal Segmentation of Lower Extremities [0.9287179270753105]
This study introduces a hybrid representation-enhanced sampling strategy that integrates both density and diversity criteria.
Experiments are performed on two lower extremity (LE) datasets of MRI and CT images.
arXiv Detail & Related papers (2023-07-26T06:52:29Z) - Test-Time Distribution Normalization for Contrastively Learned
Vision-language Models [39.66329310098645]
One of the most representative approaches proposed recently known as CLIP has garnered widespread adoption due to its effectiveness.
This paper reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time.
We propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss.
arXiv Detail & Related papers (2023-02-22T01:14:30Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - Towards Overcoming False Positives in Visual Relationship Detection [95.15011997876606]
We investigate the cause of the high false positive rate in Visual Relationship Detection (VRD)
This paper presents Spatially-Aware Balanced negative pRoposal sAmpling (SABRA) as a robust VRD framework that alleviates the influence of false positives.
arXiv Detail & Related papers (2020-12-23T06:28:00Z) - Multi-Scale Positive Sample Refinement for Few-Shot Object Detection [61.60255654558682]
Few-shot object detection (FSOD) helps detectors adapt to unseen classes with few training instances.
We propose a Multi-scale Positive Sample Refinement (MPSR) approach to enrich object scales in FSOD.
MPSR generates multi-scale positive samples as object pyramids and refines the prediction at various scales.
arXiv Detail & Related papers (2020-07-18T09:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.