Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization
- URL: http://arxiv.org/abs/2510.06732v1
- Date: Wed, 08 Oct 2025 07:40:40 GMT
- Title: Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization
- Authors: Tiancheng Xing, Jerry Li, Yixuan Du, Xiyang Hu,
- Abstract summary: We present Rank Anything First (RAF), a two-stage token optimization method.<n>RAF crafts concise textual perturbations to consistently promote a target item in large language models.<n>RAF generates ranking-promoting prompts token-by-token, guided by dual objectives: maximizing ranking effectiveness and preserving linguistic naturalness.
- Score: 7.7899746437628385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly used as rerankers in information retrieval, yet their ranking behavior can be steered by small, natural-sounding prompts. To expose this vulnerability, we present Rank Anything First (RAF), a two-stage token optimization method that crafts concise textual perturbations to consistently promote a target item in LLM-generated rankings while remaining hard to detect. Stage 1 uses Greedy Coordinate Gradient to shortlist candidate tokens at the current position by combining the gradient of the rank-target with a readability score; Stage 2 evaluates those candidates under exact ranking and readability losses using an entropy-based dynamic weighting scheme, and selects a token via temperature-controlled sampling. RAF generates ranking-promoting prompts token-by-token, guided by dual objectives: maximizing ranking effectiveness and preserving linguistic naturalness. Experiments across multiple LLMs show that RAF significantly boosts the rank of target items using naturalistic language, with greater robustness than existing methods in both promoting target items and maintaining naturalness. These findings underscore a critical security implication: LLM-based reranking is inherently susceptible to adversarial manipulation, raising new challenges for the trustworthiness and robustness of modern retrieval systems. Our code is available at: https://github.com/glad-lab/RAF.
Related papers
- Autoregressive Ranking: Bridging the Gap Between Dual and Cross Encoders [37.16464474575651]
We show that pointwise generative ranking with multi-token docIDs is superior to that of dual encoders.<n>We propose SToICaL - a Simple Token-Item Calibrated Loss - which can incorporate rank-aware supervision at both the item and token levels.
arXiv Detail & Related papers (2026-01-09T07:16:28Z) - Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning [70.6126069527741]
ConvRec-R1 is a two-stage framework for end-to-end training of conversational recommender systems.<n>In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline.<n>In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization.
arXiv Detail & Related papers (2025-10-23T02:56:00Z) - StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization [16.031545357388357]
We present a novel adversarial attack method that manipulates large language models (LLMs)-driven ranking systems.<n>StealthRank employs an energy-based optimization framework combined with Langevin dynamics to generate StealthRank Prompts.<n>Our results show that StealthRank consistently outperforms state-of-the-art adversarial ranking baselines in both effectiveness and stealth.
arXiv Detail & Related papers (2025-04-08T08:36:18Z) - Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning [76.50690734636477]
We introduce Rank-R1, a novel LLM-based reranker that performs reasoning over both the user query and candidate documents before performing the ranking task.<n>Our experiments on the TREC DL and BRIGHT datasets show that Rank-R1 is highly effective, especially for complex queries.
arXiv Detail & Related papers (2025-03-08T03:14:26Z) - Sliding Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models [40.21540137079309]
Long-context Language Models (LLMs) enable the full ranking of all passages within a single inference.<n>We show that full ranking with long-context LLMs can deliver superior performance in the supervised fine-tuning setting.<n>We propose a new complete listwise label construction approach and a novel importance-aware learning objective for full ranking.
arXiv Detail & Related papers (2024-12-19T06:44:59Z) - Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach.
This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets.
We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z) - FIRST: Faster Improved Listwise Reranking with Single Token Decoding [56.727761901751194]
First, we introduce FIRST, a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates.
Empirical results demonstrate that FIRST accelerates inference by 50% while maintaining a robust ranking performance with gains across the BEIR benchmark.
Our results show that LLM rerankers can provide a stronger distillation signal compared to cross-encoders, yielding substantial improvements in retriever recall after relevance feedback.
arXiv Detail & Related papers (2024-06-21T21:27:50Z) - Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models [17.420756201557957]
We propose PE-Rank, leveraging the single passage embedding as a good context compression for efficient listwise passage reranking.<n>We introduce an inference method that dynamically constrains the decoding space to these special tokens, accelerating the decoding process.<n>Results on multiple benchmarks demonstrate that PE-Rank significantly improves efficiency in both prefilling and decoding, while maintaining competitive ranking effectiveness.
arXiv Detail & Related papers (2024-06-21T03:33:51Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents [53.78782375511531]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.<n>This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)<n>To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.<n>To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.