FedPOB: Sample-Efficient Federated Prompt Optimization via Bandits
- URL: http://arxiv.org/abs/2509.24701v1
- Date: Mon, 29 Sep 2025 12:32:21 GMT
- Title: FedPOB: Sample-Efficient Federated Prompt Optimization via Bandits
- Authors: Pingchen Lu, Zhi Hong, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Yao Shu, Min Zhang, Shuang Qiu, Zhongxiang Dai,
- Abstract summary: We introduce a novel framework for sample-efficient federated prompt optimization based on multi-armed bandits (MABs)<n>The MAB framework is uniquely suited for this problem as it is (1) inherently a black-box optimization method, (2) practically sample-efficient, and (3) enables collaborative learning with theoretically guaranteed benefit from more participating agents.
- Score: 44.444223633730154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of large language models (LLMs) is highly sensitive to the input prompt, making prompt optimization a critical task. However, real-world application is hindered by three major challenges: (1) the black-box nature of powerful proprietary LLMs, (2) the need for high sample efficiency due to query costs, and (3) the desire for privacy-preserving collaboration among multiple users. To address these challenges simultaneously, we introduce a novel framework for sample-efficient federated prompt optimization based on multi-armed bandits (MABs). The MAB framework is uniquely suited for this problem as it is (1) inherently a black-box optimization method, (2) practically sample-efficient, and (3) enables collaborative learning with theoretically guaranteed benefit from more participating agents. We first propose the Federated Prompt Optimization via Bandits (FedPOB) algorithm, a federated variant of the Linear UCB algorithm, where agents collaborate by sharing model parameters instead of raw data. We then extend our approach to the practical setting of comparative user feedback by introducing FedPOB with Preference Feedback (FedPOB-Pref), an efficient algorithm based on federated dueling bandits. Extensive experiments demonstrate that both FedPOB and FedPOB-Pref significantly outperform existing baselines and that their performance consistently improves as more agents participate in the collaboration, validating the effectiveness of our federated approach.
Related papers
- MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks [21.211097851224487]
We introduce MASPOB (Multi-Agent System Prompt Optimization via Bandits), a novel sample-efficient framework based on bandits.<n>To handle topology-induced coupling, MASPOB integrates Graph Neural Networks (GNNs) to capture structural priors, learning topology-aware representations of prompt semantics.
arXiv Detail & Related papers (2026-03-03T05:59:05Z) - Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference [0.29057513016551245]
We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency.<n>We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning.
arXiv Detail & Related papers (2025-11-06T11:27:38Z) - APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport [37.21695864040979]
The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning.<n>This paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism.
arXiv Detail & Related papers (2025-10-13T03:13:28Z) - M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following [4.119014132092875]
Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following.<n>M3PO is a novel and data-efficient method designed to enhance LVLMs' capabilities in visual instruction following.<n>M3PO intelligently selects the most "learning-valuable" preference sample pairs from a diverse pool of LVLM-generated candidates.
arXiv Detail & Related papers (2025-08-17T18:07:55Z) - Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits [15.700062892888084]
We introduce a novel probing framework that strategically gathers information about selected arms before allocation.<n>In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound.<n>For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness.
arXiv Detail & Related papers (2025-06-17T21:43:21Z) - Online Clustering of Dueling Bandits [59.09590979404303]
We introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback.<n>We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions.
arXiv Detail & Related papers (2025-02-04T07:55:41Z) - Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.<n>This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z) - Efficient and Robust Regularized Federated Recommendation [52.24782464815489]
The recommender system (RSRS) addresses both user preference and privacy concerns.
We propose a novel method that incorporates non-uniform gradient descent to improve communication efficiency.
RFRecF's superior robustness compared to diverse baselines.
arXiv Detail & Related papers (2024-11-03T12:10:20Z) - TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.<n>TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z) - Efficient Prompt Optimization Through the Lens of Best Arm Identification [50.56113809171805]
This work provides a principled framework, TRIPLE, to efficiently perform prompt selection under an explicit budget constraint.
It is built on a novel connection established between prompt optimization and fixed-budget best arm identification (BAI-FB) in multi-armed bandits (MAB)
arXiv Detail & Related papers (2024-02-15T05:31:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.