Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection
- URL: http://arxiv.org/abs/2601.10067v1
- Date: Thu, 15 Jan 2026 04:46:28 GMT
- Title: Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection
- Authors: Hung Vinh Tran, Tong Chen, Hechuan Wen, Quoc Viet Hung Nguyen, Bin Cui, Hongzhi Yin,
- Abstract summary: Noise-aware Coreset Selection (NaCS) is a specialized framework for content-based recommendation systems.<n>NaCS constructs coresets through submodular optimization based on training gradients, while simultaneously correcting noisy labels.<n>We show that NaCS produces higher-quality coresets for CRSs while achieving better efficiency than existing coreset selection techniques.
- Score: 43.57971566335706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Content-based recommendation systems (CRSs) utilize content features to predict user-item interactions, serving as essential tools for helping users navigate information-rich web services. However, ensuring the effectiveness of CRSs requires large-scale and even continuous model training to accommodate diverse user preferences, resulting in significant computational costs and resource demands. A promising approach to this challenge is coreset selection, which identifies a small but representative subset of data samples that preserves model quality while reducing training overhead. Yet, the selected coreset is vulnerable to the pervasive noise in user-item interactions, particularly when it is minimally sized. To this end, we propose Noise-aware Coreset Selection (NaCS), a specialized framework for CRSs. NaCS constructs coresets through submodular optimization based on training gradients, while simultaneously correcting noisy labels using a progressively trained model. Meanwhile, we refine the selected coreset by filtering out low-confidence samples through uncertainty quantification, thereby avoid training with unreliable interactions. Through extensive experiments, we show that NaCS produces higher-quality coresets for CRSs while achieving better efficiency than existing coreset selection techniques. Notably, NaCS recovers 93-95\% of full-dataset training performance using merely 1\% of the training data. The source code is available at \href{https://github.com/chenxing1999/nacs}{https://github.com/chenxing1999/nacs}.
Related papers
- Accelerating Benchmarking of Functional Connectivity Modeling via Structure-aware Core-set Selection [8.347306013377041]
We formalize Contrastive Learning for Core-set Selection (SCLCS) as a ranking- subset selection problem.<n>SCLCS identifies stable samples via a top-k ranking, Structural Perturbation Score, and density-balanced sampling strategy.<n>On the large-scale REST-meta-MDD dataset, SCLCS preserves the ground-truth model ranking with just 10% of the data, outperforming state-of-the-art (SOTA) core-set selection methods by up to 23.2% in ranking consistency (nDCG@k)
arXiv Detail & Related papers (2026-02-05T13:50:39Z) - Improving Model Classification by Optimizing the Training Dataset [3.987352341101438]
Coresets offer a principled approach to data reduction, enabling efficient learning on large datasets.<n>We present a systematic framework for tuning the coreset generation process to enhance downstream classification quality.
arXiv Detail & Related papers (2025-07-22T16:10:11Z) - Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization [45.48642232138223]
In-context learning (ICL) enables Large Vision-Language Models (LVLMs) to adapt to new tasks without parameter updates.<n>We propose Key-based Coreset Optimization (KeCO), a novel framework that leverages untapped data to construct a compact and informative coreset.<n>KeCO effectively enhances ICL performance for image classification task, achieving an average improvement of more than 20%.
arXiv Detail & Related papers (2025-04-19T06:26:23Z) - Non-Uniform Class-Wise Coreset Selection: Characterizing Category Difficulty for Data-Efficient Transfer Learning [19.152700266277247]
Non-Uniform Class-Wise Coreset Selection (NUCS) is a novel framework that integrates both class-level and instance-level criteria.<n>Our work highlights the importance of characterizing category difficulty in coreset selection, offering a robust and data-efficient solution for transfer learning.
arXiv Detail & Related papers (2025-04-17T15:40:51Z) - Gradient Coreset for Federated Learning [27.04322811181904]
Federated Learning (FL) is used to learn machine learning models with data partitioned across multiple clients.
We propose an algorithm that selects a coreset at each client, only every $K$ communication rounds.
We demonstrate that our coreset selection technique is highly effective in accounting for noise in clients' data.
arXiv Detail & Related papers (2024-01-13T06:17:17Z) - Refined Coreset Selection: Towards Minimal Coreset Size under Model
Performance Constraints [69.27190330994635]
Coreset selection is powerful in reducing computational costs and accelerating data processing for deep learning algorithms.
We propose an innovative method, which maintains optimization priority order over the model performance and coreset size.
Empirically, extensive experiments confirm its superiority, often yielding better model performance with smaller coreset sizes.
arXiv Detail & Related papers (2023-11-15T03:43:04Z) - Contextual Squeeze-and-Excitation for Efficient Few-Shot Image
Classification [57.36281142038042]
We present a new adaptive block called Contextual Squeeze-and-Excitation (CaSE) that adjusts a pretrained neural network on a new task to significantly improve performance.
We also present a new training protocol based on Coordinate-Descent called UpperCaSE that exploits meta-trained CaSE blocks and fine-tuning routines for efficient adaptation.
arXiv Detail & Related papers (2022-06-20T15:25:08Z) - Data Summarization via Bilevel Optimization [48.89977988203108]
A simple yet powerful approach is to operate on small subsets of data.
In this work, we propose a generic coreset framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem.
arXiv Detail & Related papers (2021-09-26T09:08:38Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Coresets via Bilevel Optimization for Continual Learning and Streaming [86.67190358712064]
We propose a novel coreset construction via cardinality-constrained bilevel optimization.
We show how our framework can efficiently generate coresets for deep neural networks, and demonstrate its empirical benefits in continual learning and in streaming settings.
arXiv Detail & Related papers (2020-06-06T14:20:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.