Enhancing and Scaling Search Query Datasets for Recommendation Systems
- URL: http://arxiv.org/abs/2505.11176v2
- Date: Fri, 22 Aug 2025 16:07:58 GMT
- Title: Enhancing and Scaling Search Query Datasets for Recommendation Systems
- Authors: Aaron Rodrigues, Mahmood Hegazy, Azzam Naeem,
- Abstract summary: This paper presents a deployed, production-grade system to enhance and scale search query datasets for intent-based recommendation systems in digital banking.<n>The proposed system integrates three core modules: Synthetic Query Generation, Intent Disambiguation, and Intent Gap Analysis.<n>This work underscores the role of high-quality, scalable data in modern AI-driven applications and advocates a proactive approach to data enhancement as a key driver of value.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents a deployed, production-grade system designed to enhance and scale search query datasets for intent-based recommendation systems in digital banking. In real-world environments, the growing volume and complexity of user intents create substantial challenges for data management, resulting in suboptimal recommendations and delayed product onboarding. To overcome these challenges, our approach shifts the focus from model-centric enhancements to automated, data-centric strategies. The proposed system integrates three core modules: Synthetic Query Generation, Intent Disambiguation, and Intent Gap Analysis. Synthetic Query Generation produces diverse and realistic user queries. Our experiments reveal no statistically significant difference when using synthetic data for Clinc150, while Banking77 and a proprietary dataset show significant differences. We dig into the underlying factors driving these variations, demonstrating that our approach effectively alleviates the cold start problem (i.e. the challenge of recommending new products with limited historical data). Intent Disambiguation refines broad and overlapping intent categories into precise subintents, achieving an F1 score of 0.863 $\pm$ 0.127 against expert reannotations and leading to clearer differentiation and more precise recommendation mapping. Meanwhile, Intent Gap Analysis identifies latent customer needs by extracting novel intents from unlabeled queries; recovery rates reach up to 71\% in controlled evaluations. Deployed in a live banking environment, our system demonstrates significant improvements in recommendation precision and operation agility, ultimately delivering enhanced user experiences and strategic business benefits. This work underscores the role of high-quality, scalable data in modern AI-driven applications and advocates a proactive approach to data enhancement as a key driver of value.
Related papers
- Can Recommender Systems Teach Themselves? A Recursive Self-Improving Framework with Fidelity Control [82.30868101940068]
We propose a paradigm in which a model bootstraps its own performance without reliance on external data or teacher models.<n>Our theoretical analysis shows that RSIR acts as a data-driven implicit regularizer, smoothing the optimization landscape.<n>We show that even smaller models benefit, and weak models can generate effective training curricula for stronger ones.
arXiv Detail & Related papers (2026-02-17T15:31:32Z) - Robust Assortment Optimization from Observational Data [32.720761309403436]
We propose a framework for data-driven assortment optimization that accounts for potential distributional shifts in customer choice behavior.<n>Our approach models potential preference shift from a nominal choice model that generates data and seeks to maximize worst-case expected revenue.
arXiv Detail & Related papers (2026-02-11T09:57:16Z) - Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward [24.738836592075927]
We introduce a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward.<n>Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines.<n>Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval.
arXiv Detail & Related papers (2026-01-31T18:15:50Z) - A Systematic Framework for Enterprise Knowledge Retrieval: Leveraging LLM-Generated Metadata to Enhance RAG Systems [0.0]
This research presents a systematic framework for metadata enrichment using large language models (LLMs) to enhance document retrieval in Retrieval-Augmented Generation (RAG) systems.<n>Our approach employs a comprehensive, structured pipeline that dynamically generates meaningful metadata for document segments.
arXiv Detail & Related papers (2025-12-05T04:05:06Z) - Intent-Aware Neural Query Reformulation for Behavior-Aligned Product Search [0.0]
This work introduces a robust data pipeline designed to mine and analyze large-scale buyer query logs.<n>The pipeline systematically captures patterns indicative of latent purchase intent, enabling the construction of a high-fidelity, intent-rich dataset.<n>Our findings highlight the value of intent-centric modeling in bridging the gap between sparse user inputs and complex product discovery goals.
arXiv Detail & Related papers (2025-07-29T20:20:07Z) - Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z) - Privacy-Preserving Synthetic Review Generation with Diverse Writing Styles Using LLMs [6.719863580831653]
Synthetic data generated by Large Language Models (LLMs) provides cost-effective, scalable alternative to real-world data to facilitate model training.<n>We quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective) of synthetic datasets generated by several state-of-the-art LLMs.<n> Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.
arXiv Detail & Related papers (2025-07-24T03:12:16Z) - InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation [63.55258191625131]
InfoDeepSeek is a new benchmark for assessing agentic information seeking in real-world, dynamic web environments.<n>We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity.<n>We develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes.
arXiv Detail & Related papers (2025-05-21T14:44:40Z) - RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs [3.41612427812159]
In digital content creation tools, users express their needs through natural language queries that must be mapped to API calls.<n>Existing approaches to synthetic data generation fail to replicate real-world data distributions.<n>We present a novel router-based architecture that generates high-quality synthetic training data.
arXiv Detail & Related papers (2025-05-15T16:53:45Z) - From Reviews to Dialogues: Active Synthesis for Zero-Shot LLM-based Conversational Recommender System [49.57258257916805]
Large Language Models (LLMs) demonstrate strong zero-shot recommendation capabilities.<n>Practical applications often favor smaller, internally managed recommender models due to scalability, interpretability, and data privacy constraints.<n>We propose an active data augmentation framework that synthesizes conversational training data by leveraging black-box LLMs guided by active learning techniques.
arXiv Detail & Related papers (2025-04-21T23:05:47Z) - Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval [49.669503570350166]
Generative information retrieval (GenIR) is a promising neural retrieval paradigm that formulates document retrieval as a document identifier (docid) generation task.<n>Existing GenIR models suffer from token-level misalignment, where models trained to predict the next token often fail to capture document-level relevance effectively.<n>We propose direct document relevance optimization (DDRO), which aligns token-level docid generation with document-level relevance estimation through direct optimization via pairwise ranking.
arXiv Detail & Related papers (2025-04-07T15:27:37Z) - Synthetic Data Generation Using Large Language Models: Advances in Text and Code [0.0]
Large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains.<n>We highlight key techniques such as prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement.<n>We discuss the accompanying challenges, including factual inaccuracies in generated text, insufficient stylistic or distributional realism, and risks of bias amplification.
arXiv Detail & Related papers (2025-03-18T08:34:03Z) - A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z) - Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024.<n>We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z) - Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.<n>We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.<n>Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - Learning Recommender Systems with Soft Target: A Decoupled Perspective [49.83787742587449]
We propose a novel decoupled soft label optimization framework to consider the objectives as two aspects by leveraging soft labels.
We present a sensible soft-label generation algorithm that models a label propagation algorithm to explore users' latent interests in unobserved feedback via neighbors.
arXiv Detail & Related papers (2024-10-09T04:20:15Z) - Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting [23.61061000692023]
This study proposes leveraging user interactions recorded in search logs to yield insights into users' implicit search intentions.<n>We propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with Large Language Models.
arXiv Detail & Related papers (2024-08-18T11:07:38Z) - Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions [17.96479268328824]
We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content.
We propose a multi-step generation process, predicated on the idea of creating compact representations of discussion threads.
arXiv Detail & Related papers (2024-08-15T18:43:50Z) - Retrieval Augmentation via User Interest Clustering [57.63883506013693]
Industrial recommender systems are sensitive to the patterns of user-item engagement.
We propose a novel approach that efficiently constructs user interest and facilitates low computational cost inference.
Our approach has been deployed in multiple products at Meta, facilitating short-form video related recommendation.
arXiv Detail & Related papers (2024-08-07T16:35:10Z) - Exploring Augmentation and Cognitive Strategies for AI based Synthetic Personae [1.0742675209112622]
This position paper advocates for using large language models (LLMs) as data augmentation systems rather than zero-shot generators.
We propose the development of robust cognitive and memory frameworks to guide LLM responses.
Initial explorations suggest that data enrichment, episodic memory, and self-reflection techniques can improve the reliability of synthetic personae.
arXiv Detail & Related papers (2024-04-16T20:22:12Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge.
We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective.
We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z) - An explainable machine learning-based approach for analyzing customers'
online data to identify the importance of product attributes [0.6437284704257459]
We propose a game theory machine learning (ML) method that extracts comprehensive design implications for product development.
We apply our method to a real-world dataset of laptops from Kaggle, and derive design implications based on the results.
arXiv Detail & Related papers (2024-02-03T20:50:48Z) - Mining Implicit Entity Preference from User-Item Interaction Data for
Knowledge Graph Completion via Adversarial Learning [82.46332224556257]
We propose a novel adversarial learning approach by leveraging user interaction data for the Knowledge Graph Completion task.
Our generator is isolated from user interaction data, and serves to improve the performance of the discriminator.
To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks.
arXiv Detail & Related papers (2020-03-28T05:47:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.