Auditing the Use of Language Models to Guide Hiring Decisions
- URL: http://arxiv.org/abs/2404.03086v1
- Date: Wed, 3 Apr 2024 22:01:26 GMT
- Title: Auditing the Use of Language Models to Guide Hiring Decisions
- Authors: Johann D. Gaebler, Sharad Goel, Aziz Huq, Prasanna Tambe,
- Abstract summary: Regulatory efforts to protect against algorithmic bias have taken on increased urgency with rapid advances in large language models.
Current regulations -- as well as the scientific literature -- provide little guidance on how to conduct these assessments.
Here we propose and investigate one approach for auditing algorithms: correspondence experiments.
- Score: 2.949890760187898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Regulatory efforts to protect against algorithmic bias have taken on increased urgency with rapid advances in large language models (LLMs), which are machine learning models that can achieve performance rivaling human experts on a wide array of tasks. A key theme of these initiatives is algorithmic "auditing," but current regulations -- as well as the scientific literature -- provide little guidance on how to conduct these assessments. Here we propose and investigate one approach for auditing algorithms: correspondence experiments, a widely applied tool for detecting bias in human judgements. In the employment context, correspondence experiments aim to measure the extent to which race and gender impact decisions by experimentally manipulating elements of submitted application materials that suggest an applicant's demographic traits, such as their listed name. We apply this method to audit candidate assessments produced by several state-of-the-art LLMs, using a novel corpus of applications to K-12 teaching positions in a large public school district. We find evidence of moderate race and gender disparities, a pattern largely robust to varying the types of application material input to the models, as well as the framing of the task to the LLMs. We conclude by discussing some important limitations of correspondence experiments for auditing algorithms.
Related papers
- How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models [1.3812010983144802]
This study evaluates large language models (LLMs) across diverse domains, including cybersecurity, medicine, and finance.
The results indicate that model size and types of prompts used for inference significantly influenced response length and quality.
arXiv Detail & Related papers (2024-06-25T20:52:31Z) - Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review [4.181146104301203]
Large Language Models (LLMs) have been developed to assist programming tasks including the generation of program code from natural language input.
This paper provides a critical review of the existing work on the testing and evaluation of these tools.
arXiv Detail & Related papers (2024-06-18T14:25:34Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Toward Operationalizing Pipeline-aware ML Fairness: A Research Agenda
for Developing Practical Guidelines and Tools [18.513353100744823]
Recent work has called on the ML community to take a more holistic approach to tackle fairness issues.
We first demonstrate that without clear guidelines and toolkits, even individuals with specialized ML knowledge find it challenging to hypothesize how various design choices influence model behavior.
We then consult the fair-ML literature to understand the progress to date toward operationalizing the pipeline-aware approach.
arXiv Detail & Related papers (2023-09-29T15:48:26Z) - Active Learning Principles for In-Context Learning with Large Language
Models [65.09970281795769]
This paper investigates how Active Learning algorithms can serve as effective demonstration selection methods for in-context learning.
We show that in-context example selection through AL prioritizes high-quality examples that exhibit low uncertainty and bear similarity to the test examples.
arXiv Detail & Related papers (2023-05-23T17:16:04Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - BAD: BiAs Detection for Large Language Models in the context of
candidate screening [6.47452771256903]
This project aims to quantify the instances of social bias in ChatGPT and other OpenAI LLMs in the context of candidate screening.
We will show how the use of these models could perpetuate existing biases and inequalities in the hiring process.
arXiv Detail & Related papers (2023-05-17T17:47:31Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z) - Testing Occupational Gender Bias in Language Models: Towards Robust Measurement and Zero-Shot Debiasing [98.07536837448293]
Large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics.
We introduce a list of desiderata for robustly measuring biases in generative language models.
We then use this benchmark to test several state-of-the-art open-source LLMs, including Llama, Mistral, and their instruction-tuned versions.
arXiv Detail & Related papers (2022-12-20T22:41:24Z) - Individual Explanations in Machine Learning Models: A Survey for
Practitioners [69.02688684221265]
The use of sophisticated statistical models that influence decisions in domains of high societal relevance is on the rise.
Many governments, institutions, and companies are reluctant to their adoption as their output is often difficult to explain in human-interpretable ways.
Recently, the academic literature has proposed a substantial amount of methods for providing interpretable explanations to machine learning models.
arXiv Detail & Related papers (2021-04-09T01:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.