Auditing the Use of Language Models to Guide Hiring Decisions
- URL: http://arxiv.org/abs/2404.03086v1
- Date: Wed, 3 Apr 2024 22:01:26 GMT
- Title: Auditing the Use of Language Models to Guide Hiring Decisions
- Authors: Johann D. Gaebler, Sharad Goel, Aziz Huq, Prasanna Tambe,
- Abstract summary: Regulatory efforts to protect against algorithmic bias have taken on increased urgency with rapid advances in large language models.
Current regulations -- as well as the scientific literature -- provide little guidance on how to conduct these assessments.
Here we propose and investigate one approach for auditing algorithms: correspondence experiments.
- Score: 2.949890760187898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Regulatory efforts to protect against algorithmic bias have taken on increased urgency with rapid advances in large language models (LLMs), which are machine learning models that can achieve performance rivaling human experts on a wide array of tasks. A key theme of these initiatives is algorithmic "auditing," but current regulations -- as well as the scientific literature -- provide little guidance on how to conduct these assessments. Here we propose and investigate one approach for auditing algorithms: correspondence experiments, a widely applied tool for detecting bias in human judgements. In the employment context, correspondence experiments aim to measure the extent to which race and gender impact decisions by experimentally manipulating elements of submitted application materials that suggest an applicant's demographic traits, such as their listed name. We apply this method to audit candidate assessments produced by several state-of-the-art LLMs, using a novel corpus of applications to K-12 teaching positions in a large public school district. We find evidence of moderate race and gender disparities, a pattern largely robust to varying the types of application material input to the models, as well as the framing of the task to the LLMs. We conclude by discussing some important limitations of correspondence experiments for auditing algorithms.
Related papers
- NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity.
A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z) - Whither Bias Goes, I Will Go: An Integrative, Systematic Review of Algorithmic Bias Mitigation [1.0470286407954037]
Concerns have been raised that machine learning (ML) models may be biased and perpetuate or exacerbate inequality.
We present a four-stage model of developing ML assessments and applying bias mitigation methods.
arXiv Detail & Related papers (2024-10-21T02:32:14Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Entity Extraction from High-Level Corruption Schemes via Large Language Models [4.820586736502356]
This article proposes a new micro-benchmark dataset for algorithms and models that identify individuals and organizations in news articles.
Experimental efforts are also reported, using this dataset, to identify individuals and organizations in financial-crime-related articles.
arXiv Detail & Related papers (2024-09-05T10:27:32Z) - Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making [1.3812010983144802]
This study evaluates large language models (LLMs) across diverse domains, including cybersecurity, medicine, and finance.
The results indicate that model size and types of prompts used for inference significantly influenced response length and quality.
arXiv Detail & Related papers (2024-06-25T20:52:31Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Active Learning Principles for In-Context Learning with Large Language
Models [65.09970281795769]
This paper investigates how Active Learning algorithms can serve as effective demonstration selection methods for in-context learning.
We show that in-context example selection through AL prioritizes high-quality examples that exhibit low uncertainty and bear similarity to the test examples.
arXiv Detail & Related papers (2023-05-23T17:16:04Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - BAD: BiAs Detection for Large Language Models in the context of
candidate screening [6.47452771256903]
This project aims to quantify the instances of social bias in ChatGPT and other OpenAI LLMs in the context of candidate screening.
We will show how the use of these models could perpetuate existing biases and inequalities in the hiring process.
arXiv Detail & Related papers (2023-05-17T17:47:31Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - Individual Explanations in Machine Learning Models: A Survey for
Practitioners [69.02688684221265]
The use of sophisticated statistical models that influence decisions in domains of high societal relevance is on the rise.
Many governments, institutions, and companies are reluctant to their adoption as their output is often difficult to explain in human-interpretable ways.
Recently, the academic literature has proposed a substantial amount of methods for providing interpretable explanations to machine learning models.
arXiv Detail & Related papers (2021-04-09T01:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.