Prompt-to-Leaderboard
- URL: http://arxiv.org/abs/2502.14855v2
- Date: Mon, 10 Mar 2025 06:44:48 GMT
- Title: Prompt-to-Leaderboard
- Authors: Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica,
- Abstract summary: We propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt.<n>Our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves.
- Score: 20.299021582134202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot on the Chatbot Arena leaderboard. Our code is available on GitHub at https://github.com/lmarena/p2l.
Related papers
- On the Role of Preference Variance in Preference Optimization [55.364953481473286]
We investigate the impact of emphpreference variance (PVar) on the effectiveness of Direct Preference Optimization (DPO) training.<n>We show that prompts with higher PVar outperform randomly selected prompts or those with lower PVar.
arXiv Detail & Related papers (2025-10-14T22:34:52Z) - Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations [0.0]
Large language models (LLMs) have been adopted for prompt-based recommendation.<n>LLMs face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking.<n>We propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts.
arXiv Detail & Related papers (2025-05-08T05:01:44Z) - The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input [19.692322010161636]
FACTS Grounding evaluates language models' ability to generate text that is factually accurate with respect to given context.<n>Models are evaluated using automated judge models in two phases.<n>The FACTS Grounding leaderboard will be actively maintained over time.
arXiv Detail & Related papers (2025-01-06T18:28:04Z) - Show, Don't Tell: Aligning Language Models with Demonstrated Feedback [54.10302745921713]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.
We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators [59.48172585509628]
We propose a simple regression analysis approach for controlling biases in auto-evaluations.
As a real case study, we focus on reducing the length bias of AlpacaEval, a benchmark for chat LLMs.
We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?"
arXiv Detail & Related papers (2024-04-06T02:29:02Z) - Few-Shot Cross-Lingual Transfer for Prompting Large Language Models in
Low-Resource Languages [0.0]
"prompting" is where a user provides a description of a task and some completed examples of the task to a PLM as context before prompting the PLM to perform the task on a new example.
We consider three methods: few-shot prompting (prompt), language-adaptive fine-tuning (LAFT), and neural machine translation (translate)
We find that translate and prompt settings are a compute-efficient and cost-effective method of few-shot prompting for the selected low-resource languages.
arXiv Detail & Related papers (2024-03-09T21:36:13Z) - Is Crowdsourcing Breaking Your Bank? Cost-Effective Fine-Tuning of
Pre-trained Language Models with Proximal Policy Optimization [18.75866961339424]
ChatGPT has highlighted the potential of reinforcement learning from human feedback.
To reduce labor costs, we propose a self-supervised text ranking approach.
arXiv Detail & Related papers (2024-02-28T12:24:07Z) - Measuring and Controlling Instruction (In)Stability in Language Model Dialogs [72.38330196290119]
System-prompting is a tool for customizing language-model chatbots, enabling them to follow a specific instruction.
We propose a benchmark to test the assumption, evaluating instruction stability via self-chats.
We reveal a significant instruction drift within eight rounds of conversations.
We propose a lightweight method called split-softmax, which compares favorably against two strong baselines.
arXiv Detail & Related papers (2024-02-13T20:10:29Z) - Large Language Models are Zero-Shot Rankers for Recommender Systems [76.02500186203929]
This work aims to investigate the capacity of large language models (LLMs) to act as the ranking model for recommender systems.
We show that LLMs have promising zero-shot ranking abilities but struggle to perceive the order of historical interactions.
We demonstrate that these issues can be alleviated using specially designed prompting and bootstrapping strategies.
arXiv Detail & Related papers (2023-05-15T17:57:39Z) - Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs.
Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z) - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation [83.10599735938618]
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository.
This work introduces GENIE, an human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.
arXiv Detail & Related papers (2021-01-17T00:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.