Synthetic CVs To Build and Test Fairness-Aware Hiring Tools
- URL: http://arxiv.org/abs/2508.21179v1
- Date: Thu, 28 Aug 2025 19:35:32 GMT
- Title: Synthetic CVs To Build and Test Fairness-Aware Hiring Tools
- Authors: Jorge Saldivar, Anna Gatzioura, Carlos Castillo,
- Abstract summary: This paper introduces an approach for building a synthetic dataset of CVs with features modeled on real materials collected through a data donation campaign.<n>The resulting dataset of 1,730 CVs is presented, which we envision as a potential benchmarking standard for research on algorithmic hiring discrimination.
- Score: 2.558250634293445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Algorithmic hiring has become increasingly necessary in some sectors as it promises to deal with hundreds or even thousands of applicants. At the heart of these systems are algorithms designed to retrieve and rank candidate profiles, which are usually represented by Curricula Vitae (CVs). Research has shown, however, that such technologies can inadvertently introduce bias, leading to discrimination based on factors such as candidates' age, gender, or national origin. Developing methods to measure, mitigate, and explain bias in algorithmic hiring, as well as to evaluate and compare fairness techniques before deployment, requires sets of CVs that reflect the characteristics of people from diverse backgrounds. However, datasets of these characteristics that can be used to conduct this research do not exist. To address this limitation, this paper introduces an approach for building a synthetic dataset of CVs with features modeled on real materials collected through a data donation campaign. Additionally, the resulting dataset of 1,730 CVs is presented, which we envision as a potential benchmarking standard for research on algorithmic hiring discrimination.
Related papers
- Mapping Stakeholder Needs to Multi-Sided Fairness in Candidate Recommendation for Algorithmic Hiring [0.0]
This paper presents a multi-stakeholder approach to fairness in a candidate recommender system.<n>Job seekers, companies, recruiters, and other job portal employees were interviewed.<n>We use these interviews to explore their lived experiences of unfairness in hiring.
arXiv Detail & Related papers (2025-07-29T11:37:19Z) - Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond [42.710392315326104]
We present three common data biases and study their individual and joint effect on algorithmic discrimination.<n>We develop dedicated mechanisms to detect specific types of bias, and combine them into a preliminary construct we refer to as the Data Bias Profile (DBP)<n>This initial formulation serves as a proof of concept for how different bias signals can be systematically documented.
arXiv Detail & Related papers (2025-07-09T15:52:11Z) - Study of the influence of a biased database on the prediction of standard algorithms for selecting the best candidate for an interview [0.4241054493737716]
We generate data mimicking external (discrimination) and internal biases (self-censorship)<n>We study the influence of the anonymisation of files on the quality of predictions.
arXiv Detail & Related papers (2025-05-05T12:24:31Z) - A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment [76.04306818209753]
We introduce a substantial crowdsourcing annotation dataset collected from a real-world crowdsourcing platform.
This dataset comprises approximately two thousand workers, one million tasks, and six million annotations.
We evaluate the effectiveness of several representative truth inference algorithms on this dataset.
arXiv Detail & Related papers (2024-03-10T16:00:41Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [70.45113777449373]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.<n>Key challenge in comparing existing algorithms and developing better algorithms is the lack of publicly available gold-standard data.<n>We collect a novel dataset of similarity scores that we release to the research community.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Human-Centric Multimodal Machine Learning: Recent Advances and Testbed
on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach.
Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes.
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z) - Demographic-Reliant Algorithmic Fairness: Characterizing the Risks of
Demographic Data Collection in the Pursuit of Fairness [0.0]
We consider calls to collect more data on demographics to enable algorithmic fairness.
We show how these techniques largely ignore broader questions of data governance and systemic oppression.
arXiv Detail & Related papers (2022-04-18T04:50:09Z) - Representation Bias in Data: A Survey on Identification and Resolution
Techniques [26.142021257838564]
Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately.
Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods.
This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later.
arXiv Detail & Related papers (2022-03-22T16:30:22Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Bias in Multimodal AI: Testbed for Fair Automatic Recruitment [73.85525896663371]
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
We train automatic recruitment algorithms using a set of multimodal synthetic profiles consciously scored with gender and racial biases.
Our methodology and results show how to generate fairer AI-based tools in general, and in particular fairer automated recruitment systems.
arXiv Detail & Related papers (2020-04-15T15:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.