"You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations
- URL: http://arxiv.org/abs/2510.19167v2
- Date: Thu, 23 Oct 2025 08:28:01 GMT
- Title: "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations
- Authors: Dingjie Fu, Dianxing Shi,
- Abstract summary: This paper investigates whether Large Language Models (LLMs) can pass hiring evaluations.<n>We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance.<n>Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions.
- Score: 1.1254231171451319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the proliferation of the internet and the rapid advancement of Artificial Intelligence, leading technology companies face an urgent annual demand for a considerable number of software and algorithm engineers. To efficiently and effectively identify high-potential candidates from thousands of applicants, these firms have established a multi-stage selection process, which crucially includes a standardized hiring evaluation designed to assess job-specific competencies. Motivated by the demonstrated prowess of Large Language Models (LLMs) in coding and reasoning tasks, this paper investigates a critical question: Can LLMs successfully pass these hiring evaluations? To this end, we conduct a comprehensive examination of a widely used professional assessment questionnaire. We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance. Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions. Our empirical findings lead to a striking conclusion: All evaluated LLMs fails to pass the hiring evaluation.
Related papers
- ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [94.40918390309186]
evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses.<n>We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts.<n>Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs.
arXiv Detail & Related papers (2025-10-21T17:59:44Z) - Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research [5.0043780915457114]
Large language models (LLMs) are entering qualitative research, yet no reproducible methods exist for integrating them into established approaches like thematic analysis (TA)<n>We designed and iteratively refined prompts for Phases 2-5 of Braun and Clarke's reflexive TA.<n>We conducted blind evaluations with four expert evaluators who applied rubrics derived from Braun and Clarke's quality criteria.
arXiv Detail & Related papers (2025-10-21T09:29:18Z) - LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models [13.713870642186254]
Large language models (LLMs) demonstrate remarkable capabilities across various tasks.<n>Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference.<n>We propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced.
arXiv Detail & Related papers (2025-07-30T03:50:46Z) - AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening [12.845918958645676]
We propose a multi-agent framework for resume screening using Large Language Models (LLMs)<n>The framework consists of four core agents, including a resume extractor, an evaluator, a summarizer, and a score formatter.<n>This dynamic adaptation enables personalized recruitment, bridging the gap between AI automation and talent acquisition.
arXiv Detail & Related papers (2025-04-01T12:56:39Z) - Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [52.76508734756661]
Auto-PRE is an automatic evaluation framework inspired by the peer review process.<n>Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluators based on three core traits.<n> Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.