Related papers: When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search

When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search

URL: http://arxiv.org/abs/2507.02139v1
Date: Wed, 02 Jul 2025 20:53:51 GMT
Title: When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search
Authors: William A. Ingram, Bipasha Banerjee, Edward A. Fox,
Abstract summary: Large language models (LLMs) are increasingly used to assign document relevance labels in information retrieval pipelines.<n>LLMs often disagree on borderline cases, raising concerns about how such disagreement affects downstream retrieval.<n>We show that model disagreement is systematic, not random.<n>We propose using classification disagreement as an object of analysis in retrieval evaluation, particularly in policy-relevant or thematic search tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are increasingly used to assign document relevance labels in information retrieval pipelines, especially in domains lacking human-labeled data. However, different models often disagree on borderline cases, raising concerns about how such disagreement affects downstream retrieval. This study examines labeling disagreement between two open-weight LLMs, LLaMA and Qwen, on a corpus of scholarly abstracts related to Sustainable Development Goals (SDGs) 1, 3, and 7. We isolate disagreement subsets and examine their lexical properties, rank-order behavior, and classification predictability. Our results show that model disagreement is systematic, not random: disagreement cases exhibit consistent lexical patterns, produce divergent top-ranked outputs under shared scoring functions, and are distinguishable with AUCs above 0.74 using simple classifiers. These findings suggest that LLM-based filtering introduces structured variability in document retrieval, even under controlled prompting and shared ranking logic. We propose using classification disagreement as an object of analysis in retrieval evaluation, particularly in policy-relevant or thematic search tasks.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy [9.285203198113917]
We introduce CLEAR, an interactive, open-source package for LLM-based error analysis.<n> CLEAR first generates per-instance textual feedback, then creates a set of system-level error issues, and quantifies the prevalence of each identified issue.<n>Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations.
arXiv Detail & Related papers (2025-07-24T13:15:21Z)
LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews [0.9314555897827079]
Systematic literature reviews aim to identify and evaluate all relevant papers on a topic.<n>To date, abstract screening methods using large language models (LLMs) focus on binary classification settings.<n>We propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker.
arXiv Detail & Related papers (2025-05-30T16:18:50Z)
Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents [64.43980129731587]
We propose a causal-inspired inference-time debiasing method called Causal Diagnosis and Correction (CDC)<n>CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall relevance score.<n> Experimental results across three domains demonstrate the superior debiasing effectiveness.
arXiv Detail & Related papers (2025-03-11T17:59:00Z)
Subjective Logic Encodings [20.458601113219697]
Data perspectivism seeks to leverage inter-annotator disagreement to learn models.<n>Subjective Logic SLEs is a framework for constructing classification targets that explicitly encodes annotations as opinions of the annotators.
arXiv Detail & Related papers (2025-02-17T15:14:10Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism [62.571419297164645]
This paper provides a systematic overview of prior works on the logical reasoning ability of large language models for analyzing categorical syllogisms.<n>We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective.<n>We then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets.
arXiv Detail & Related papers (2024-06-26T21:17:20Z)
Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends [38.86240794422485]
We evaluate the faithfulness of large language models for dialogue summarization. Our evaluation reveals subtleties as to what constitutes a hallucination. We introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics.
arXiv Detail & Related papers (2024-06-05T17:49:47Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.