Related papers: Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences

Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences

URL: http://arxiv.org/abs/2601.11379v1
Date: Fri, 16 Jan 2026 15:38:03 GMT
Title: Evaluating LLM Behavior in Hiring: Implicit Weights, Fairness Across Groups, and Alignment with Human Preferences
Authors: Morgane Hoffmann, Emma Jouffroy, Warren Jouanneau, Marc Palyart, Charles Pebereau,
Abstract summary: We propose a framework to evaluate an LLM's decision logic in recruitment.<n>We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace.<n>We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups.
Score: 0.8155575318208629
License: http://creativecommons.org/licenses/by/4.0/
Abstract: General-purpose Large Language Models (LLMs) show significant potential in recruitment applications, where decisions require reasoning over unstructured text, balancing multiple criteria, and inferring fit and competence from indirect productivity signals. Yet, it is still uncertain how LLMs assign importance to each attribute and whether such assignments are in line with economic principles, recruiter preferences or broader societal norms. We propose a framework to evaluate an LLM's decision logic in recruitment, by drawing on established economic methodologies for analyzing human hiring behavior. We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace and apply a full factorial design to estimate how a LLM weighs different match-relevant criteria when evaluating freelancer-project fit. We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups. Finally, we explain how a comparable experimental setup could be implemented with human recruiters to assess alignment between model and human decisions. Our findings reveal that the LLM weighs core productivity signals, such as skills and experience, but interprets certain features beyond their explicit matching value. While showing minimal average discrimination against minority groups, intersectional effects reveal that productivity signals carry different weights between demographic groups.

Related papers

Once Upon a Team: Investigating Bias in LLM-Driven Software Team Composition and Task Allocation [48.2168236140771]
This study investigates whether LLMs exhibit bias in team composition and task assignment.<n>Using three LLMs and 3,000 simulated decisions, we find systematic disparities.<n>Our findings indicate that LLMs exacerbate demographic inequities in software engineering contexts.
arXiv Detail & Related papers (2026-01-07T12:13:22Z)
On Evaluating LLM Alignment by Evaluating LLMs as Judges [68.15541137648721]
evaluating large language models' (LLMs) alignment requires them to be helpful, honest, safe, and to precisely follow human instructions.<n>We examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences.<n>We propose a benchmark that assesses alignment without directly evaluating model outputs.
arXiv Detail & Related papers (2025-11-25T18:33:24Z)
Noise, Adaptation, and Strategy: Assessing LLM Fidelity in Decision-Making [0.030586855806896043]
Large language models (LLMs) are increasingly used in social science simulations.<n>We propose a process-oriented evaluation framework to examine how LLM agents adapt under different levels of external guidance and human-derived noise.<n>We find that LLMs, by default, converge on stable and conservative strategies that diverge from observed human behaviors.
arXiv Detail & Related papers (2025-08-21T18:55:53Z)
Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences? [5.542420010310746]
A critical, yet understudied, issue is the potential divergence between an LLM's stated preferences and its revealed preferences.<n>This work formally defines and proposes a method to measure this preference deviation.<n>Our study will be crucial for integrating LLMs into services, especially those that interact directly with humans.
arXiv Detail & Related papers (2025-05-31T23:38:48Z)
Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education [8.235367170516769]
Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes.<n>However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity.<n>This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context.
arXiv Detail & Related papers (2025-03-24T22:11:22Z)
Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values [13.798198972161657]
A number of societal problems involve the distribution of resources, where fairness, along with economic efficiency, play a critical role in the desirability of outcomes.<n>This paper examines whether large language models (LLMs) adhere to fundamental fairness concepts and investigate their alignment with human preferences.
arXiv Detail & Related papers (2025-02-01T04:24:47Z)
Uncovering Factor Level Preferences to Improve Human-Model Alignment [58.50191593880829]
We introduce PROFILE, a framework that uncovers and quantifies the influence of specific factors driving preferences. ProFILE's factor level analysis explains the 'why' behind human-model alignment and misalignment. We demonstrate how leveraging factor level insights, including addressing misaligned factors, can improve alignment with human preferences.
arXiv Detail & Related papers (2024-10-09T15:02:34Z)
CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models [58.57987316300529]
Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks.<n>To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets.<n>We propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks.
arXiv Detail & Related papers (2024-07-02T16:31:37Z)
Few-Shot Fairness: Unveiling LLM's Potential for Fairness-Aware Classification [7.696798306913988]
We introduce a framework outlining fairness regulations aligned with various fairness definitions. We explore the configuration for in-context learning and the procedure for selecting in-context demonstrations using RAG. Experiments conducted with different LLMs indicate that GPT-4 delivers superior results in terms of both accuracy and fairness compared to other models.
arXiv Detail & Related papers (2024-02-28T17:29:27Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.