How role-play shapes relevance judgment in zero-shot LLM rankers
- URL: http://arxiv.org/abs/2510.17535v1
- Date: Mon, 20 Oct 2025 13:39:48 GMT
- Title: How role-play shapes relevance judgment in zero-shot LLM rankers
- Authors: Yumeng Wang, Jirui Qi, Catherine Chen, Panagiotis Eustratiadis, Suzan Verberne,
- Abstract summary: Large Language Models (LLMs) have emerged as promising zero-shot rankers.<n>Their performance is highly sensitive to prompt formulation.<n>In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings.
- Score: 15.11127856890218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have emerged as promising zero-shot rankers, but their performance is highly sensitive to prompt formulation. In particular, role-play prompts, where the model is assigned a functional role or identity, often give more robust and accurate relevance rankings. However, the mechanisms and diversity of role-play effects remain underexplored, limiting both effective use and interpretability. In this work, we systematically examine how role-play variations influence zero-shot LLM rankers. We employ causal intervention techniques from mechanistic interpretability to trace how role-play information shapes relevance judgments in LLMs. Our analysis reveals that (1) careful formulation of role descriptions have a large effect on the ranking quality of the LLM; (2) role-play signals are predominantly encoded in early layers and communicate with task instructions in middle layers, while receiving limited interaction with query or document representations. Specifically, we identify a group of attention heads that encode information critical for role-conditioned relevance. These findings not only shed light on the inner workings of role-play in LLM ranking but also offer guidance for designing more effective prompts in IR and beyond, pointing toward broader opportunities for leveraging role-play in zero-shot applications.
Related papers
- The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation [73.16564415490113]
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources.<n>We propose two approaches, FairFT and FairFilter, to mitigate the fairness issues introduced by RAG for small-scale LLMs.
arXiv Detail & Related papers (2025-04-11T10:17:10Z) - How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks.<n>We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z) - Reasoning Does Not Necessarily Improve Role-Playing Ability [46.441264660062195]
The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains.<n>We compare the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs.<n>Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, and Chinese role-playing performance surpasses English role-playing performance.
arXiv Detail & Related papers (2025-02-24T08:08:41Z) - RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following [31.80357046048002]
Role-playing is important for Large Language Models to follow diverse instructions.<n>Existing role-playing datasets mostly contribute to controlling role style and knowledge boundaries.<n>We introduce a fine-grained role-playing and instruction-following benchmark, named RoleMRC.
arXiv Detail & Related papers (2025-02-17T03:08:37Z) - Thinking Before Speaking: A Role-playing Model with Mindset [0.6428333375712125]
Large Language Models (LLMs) are skilled at simulating human behaviors.
These models tend to perform poorly when confronted with knowledge that the assumed role does not possess.
We propose a Thinking Before Speaking (TBS) model in this paper.
arXiv Detail & Related papers (2024-09-14T02:41:48Z) - SocialBench: Sociality Evaluation of Role-Playing Conversational Agents [85.6641890712617]
Large language models (LLMs) have advanced the development of various AI conversational agents.
SocialBench is the first benchmark designed to evaluate the sociality of role-playing conversational agents at both individual and group levels.
We find that agents excelling in individual level does not imply their proficiency in group level.
arXiv Detail & Related papers (2024-03-20T15:38:36Z) - Large Language Models are Superpositions of All Characters: Attaining
Arbitrary Role-play via Self-Alignment [62.898963074989766]
We introduce Ditto, a self-alignment method for role-play.
This method creates a role-play training set comprising 4,000 characters, surpassing the scale of currently available datasets by tenfold.
We present the first comprehensive cross-supervision alignment experiment in the role-play domain.
arXiv Detail & Related papers (2024-01-23T03:56:22Z) - RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models [107.00832724504752]
We introduce RoleLLM, a framework to benchmark, elicit, and enhance role-playing abilities in Large Language Models (LLMs)
By Context-Instruct and RoleGPT, we create RoleBench, the first systematic and fine-grained character-level benchmark dataset for role-playing with 168,093 samples.
arXiv Detail & Related papers (2023-10-01T17:52:59Z) - RODE: Learning Roles to Decompose Multi-Agent Tasks [69.56458960841165]
Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles.
We propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents.
By virtue of these advances, our method outperforms the current state-of-the-art MARL algorithms on 10 of the 14 scenarios that comprise the challenging StarCraft II micromanagement benchmark.
arXiv Detail & Related papers (2020-10-04T09:20:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.