How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models
- URL: http://arxiv.org/abs/2508.16757v1
- Date: Fri, 22 Aug 2025 19:30:04 GMT
- Title: How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models
- Authors: Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt,
- Abstract summary: We evaluate state-of-the-art reranking methods, including large language model (LLM)-based, lightweight contextual, and zero-shot approaches.<n>Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts.
- Score: 24.90505576458548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyze the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalization ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches. https://github.com/DataScienceUIBK/llm-reranking-generalization-study
Related papers
- What Works for 'Lost-in-the-Middle' in LLMs? A Study on GM-Extract and Mitigations [1.2879523047871226]
GM-Extract is a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables.<n>We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering)<n>While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models.
arXiv Detail & Related papers (2025-11-17T20:50:50Z) - LimRank: Less is More for Reasoning-Intensive Information Reranking [58.32304478331711]
Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks.<n>In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision.
arXiv Detail & Related papers (2025-10-27T17:19:37Z) - Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z) - Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs [1.6332728502735252]
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications.<n>Their reliability remains hampered by challenges in detecting hallucinations.<n>This paper addresses the bottleneck of data annotation by investigating the feasibility of reducing training data requirements.
arXiv Detail & Related papers (2025-05-29T09:50:56Z) - LLMs as Data Annotators: How Close Are We to Human Performance [47.61698665650761]
Manual annotation of data is labor-intensive, time-consuming, and costly.<n>In-context learning (ICL) in which some examples related to the task are given in the prompt can lead to inefficiencies and suboptimal model performance.<n>This paper presents experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task.
arXiv Detail & Related papers (2025-04-21T11:11:07Z) - Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon [11.753349115726952]
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues.<n>We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that distorts benchmark prompts.<n>By rephrasing inputs while preserving semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns.
arXiv Detail & Related papers (2025-02-11T10:43:36Z) - Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z) - ICLERB: In-Context Learning Embedding and Reranker Benchmark [45.40331863265474]
In-Context Learning (ICL) enables Large Language Models to perform new tasks by conditioning on prompts with relevant information.<n>Traditional retrieval methods focus on semantic relevance, treating retrieval as a search problem.<n>We propose reframing retrieval for ICL as a recommendation problem, aiming to select documents that maximize utility in ICL tasks.
arXiv Detail & Related papers (2024-11-28T06:28:45Z) - LLM In-Context Recall is Prompt Dependent [0.0]
A model's ability to do this significantly influences its practical efficacy and dependability in real-world applications.
This study demonstrates that an LLM's recall capability is not only contingent upon the prompt's content but also may be compromised by biases in its training data.
arXiv Detail & Related papers (2024-04-13T01:13:59Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.