Related papers: Measuring Validity in LLM-based Resume Screening

Measuring Validity in LLM-based Resume Screening

URL: http://arxiv.org/abs/2602.18550v1
Date: Fri, 20 Feb 2026 18:57:52 GMT
Title: Measuring Validity in LLM-based Resume Screening
Authors: Jane Castleman, Zeyu Shen, Blossom Metevier, Max Springer, Aleksandra Korolova,
Abstract summary: We construct a large dataset of resumes tailored to specific jobs that are directly comparable with a known ground truth of superiority.<n>We then use the constructed dataset to measure the validity of ranking decisions made by various LLMs.<n>We find that models do not reliably abstain when ranking equally-qualified candidates, and select candidates from different demographic groups at different rates.
Score: 45.886624898999145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Resume screening is perceived as a particularly suitable task for LLMs given their ability to analyze natural language; thus many entities rely on general purpose LLMs without further adapting them to the task. While researchers have shown that some LLMs are biased in their selection rates of different demographics, studies measuring the validity of LLM decisions are limited. One of the difficulties in externally measuring validity stems from lack of access to a large corpus of resumes for whom the ground truth in their ranking is known and that has not already been used for LLM training. In this work, we overcome this challenge by systematically constructing a large dataset of resumes tailored to particular jobs that are directly comparable, with a known ground truth of superiority. We then use the constructed dataset to measure the validity of ranking decisions made by various LLMs, finding that many models are unable to consistently select the resumes describing more qualified candidates. Furthermore, when measuring the validity of decisions, we find that models do not reliably abstain when ranking equally-qualified candidates, and select candidates from different demographic groups at different rates, occasionally prioritizing historically-marginalized candidates. Our proposed framework provides a principled approach to audit LLM resume screeners in the absence of ground truth, offering a crucial tool to independent auditors and developers to ensure the validity of these systems as they are deployed.

Related papers

LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation [110.610512800947]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge.<n>Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage.
arXiv Detail & Related papers (2025-10-13T12:57:45Z)
AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights [0.0611737116137921]
We show that large language models (LLMs) systematically favor their own generated content over human-written resumes.<n>This bias can be reduced by more than 50% through simple interventions targeting LLMs' self-recognition capabilities.<n>These findings highlight an emerging but previously overlooked risk in AI-assisted decision making.
arXiv Detail & Related papers (2025-08-30T11:40:11Z)
Evaluating how LLM annotations represent diverse views on contentious topics [3.405231040967506]
We show that generative large language models (LLMs) tend to be biased in the same directions on the same demographic categories within the same datasets.<n>We conclude with a discussion of the implications for researchers and practitioners using LLMs for automated data annotation tasks.
arXiv Detail & Related papers (2025-03-29T22:53:15Z)
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.<n>It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.<n>We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z)
LLM-Forest: Ensemble Learning of LLMs with Graph-Augmented Prompts for Data Imputation [50.375567142250446]
Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation.<n>We propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot prompt learning LLM "trees" with their outputs aggregated via confidence-based weighted voting.<n>This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries with both feature and value granularity.
arXiv Detail & Related papers (2024-10-28T20:42:46Z)
PRISM: A Methodology for Auditing Biases in Large Language Models [9.751718230639376]
PRISM is a flexible, inquiry-based methodology for auditing Large Language Models. It seeks to illicit such positions indirectly through task-based inquiry prompting rather than direct inquiry of said preferences.
arXiv Detail & Related papers (2024-10-24T16:57:20Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity [27.10502683001428]
This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. Experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts.
arXiv Detail & Related papers (2024-07-24T09:48:48Z)
Direct-Inverse Prompting: Analyzing LLMs' Discriminative Capacity in Self-Improving Generation [15.184067502284007]
Even the most advanced LLMs experience uncertainty in their outputs, often producing varied results on different runs or when faced with minor changes in input. We propose and analyze three discriminative prompts: direct, inverse, and hybrid. Our insights reveal which discriminative prompt is most promising and when to use it.
arXiv Detail & Related papers (2024-06-27T02:26:47Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.