Related papers: Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?

Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?

URL: http://arxiv.org/abs/2410.16285v1
Date: Sat, 05 Oct 2024 14:37:35 GMT
Title: Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?
Authors: John Mavi, Nathan Summers, Sergio Coronado,
Abstract summary: SelfScore is a benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation tasks. The benchmark evaluates agents on problem complexity and response helpfulness, ensuring transparency and simplicity in its scoring system. The study raises concerns about the potential displacement of human workers, especially in areas where AI technologies excel.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The current paper presents the development and validation of SelfScore, a novel benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation tasks. Given the increasing integration of AI in industries, particularly within customer service, SelfScore fills a crucial gap by enabling the comparison of automated agents and human workers. The benchmark evaluates agents on problem complexity and response helpfulness, ensuring transparency and simplicity in its scoring system. The study also develops automated LLM agents to assess SelfScore and explores the benefits of Retrieval-Augmented Generation (RAG) for domain-specific tasks, demonstrating that automated LLM agents incorporating RAG outperform those without. All automated LLM agents were observed to perform better than the human control group. Given these results, the study raises concerns about the potential displacement of human workers, especially in areas where AI technologies excel. Ultimately, SelfScore provides a foundational tool for understanding the impact of AI in help desk environments while advocating for ethical considerations in the ongoing transition towards automation.

Related papers

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents [11.783547185760007]
Large Language Models (LLMs) are increasingly used as assistants for data science.<n> Proper automation of some data-science activities is now promised by the rise of LLM agents.
arXiv Detail & Related papers (2025-06-10T13:47:22Z)
Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth [21.672923905771576]
Large language models (LLMs) by crowdsourcing workers pose a challenge to datasets intended to reflect human input.<n>We propose a training-free scoring mechanism with theoretical guarantees under a crowdsourcing model that accounts for LLM collusion.
arXiv Detail & Related papers (2025-06-08T04:38:39Z)
E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing [1.5728609542259502]
This paper presents an intelligent work automation approach in the context of contemporary digital transformation.<n>It integrates generative AI and Intelligent Document Processing technologies with an Automation Agent to realize End-to-End (E2E) automation of corporate financial expense processing tasks.
arXiv Detail & Related papers (2025-05-27T05:21:08Z)
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks [52.47895046206854]
FieldWorkArena is a benchmark for agentic AI targeting real-world field work.<n>This paper defines a new action space that agentic AI should possess for real world work environment benchmarks.
arXiv Detail & Related papers (2025-05-26T08:21:46Z)
The Real Barrier to LLM Agent Usability is Agentic ROI [110.31127571114635]
Large Language Model (LLM) agents represent a promising shift in human-AI interaction.<n>We highlight a critical usability gap in high-demand, mass-market applications.
arXiv Detail & Related papers (2025-05-23T11:40:58Z)
A Self-Improving Coding Agent [23.44829720834145]
Large Language Models (LLMs) have spurred interest in deploying LLM agents to undertake tasks in the world.<n>We demonstrate that an agent system, equipped with basic coding tools, can autonomously edit itself, and thereby improve its performance on benchmark tasks.
arXiv Detail & Related papers (2025-04-21T16:58:18Z)
LLM-based Automated Grading with Human-in-the-Loop [32.14015215819979]
Large language models (LLMs) are increasingly being used for automatic short answer grading (ASAG) In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically.
arXiv Detail & Related papers (2025-04-07T16:23:07Z)
AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening [12.845918958645676]
We propose a multi-agent framework for resume screening using Large Language Models (LLMs) The framework consists of four core agents, including a resume extractor, an evaluator, a summarizer, and a score formatter. This dynamic adaptation enables personalized recruitment, bridging the gap between AI automation and talent acquisition.
arXiv Detail & Related papers (2025-04-01T12:56:39Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks. However, they still struggle with problems requiring multi-step decision-making and environmental feedback. We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We build a self-contained environment with data that mimics a small software company environment. We find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents.
arXiv Detail & Related papers (2024-12-18T18:55:40Z)
PentestAgent: Incorporating LLM Agents to Automated Penetration Testing [6.815381197173165]
Manual penetration testing is time-consuming and expensive. Recent advancements in large language models (LLMs) offer new opportunities for enhancing penetration testing. We propose PentestAgent, a novel LLM-based automated penetration testing framework.
arXiv Detail & Related papers (2024-11-07T21:10:39Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks [20.072783454089098]
This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling.
arXiv Detail & Related papers (2024-10-11T00:56:37Z)
AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z)
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z)
Position: A Call to Action for a Human-Centered AutoML Paradigm [83.78883610871867]
Automated machine learning (AutoML) was formed around the fundamental objectives of automatically and efficiently configuring machine learning (ML) We argue that a key to unlocking AutoML's full potential lies in addressing the currently underexplored aspect of user interaction with AutoML systems.
arXiv Detail & Related papers (2024-06-05T15:05:24Z)
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents. In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z)
Characteristic AI Agents via Large Language Models [40.10858767752735]
This research focuses on investigating the performance of Large Language Models in constructing characteristic AI agents. A dataset called Character100'' is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play. The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents.
arXiv Detail & Related papers (2024-03-19T02:25:29Z)
AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning [54.47116888545878]
AutoAct is an automatic agent learning framework for QA. It does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models.
arXiv Detail & Related papers (2024-01-10T16:57:24Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
Towards LLM-based Autograding for Short Textual Answers [4.853810201626855]
This manuscript is an evaluation of a large language model for the purpose of autograding. Our findings suggest that while "out-of-the-box" LLMs provide a valuable tool, their readiness for independent automated grading remains a work in progress.
arXiv Detail & Related papers (2023-09-09T22:25:56Z)
Assessing the Use of AutoML for Data-Driven Software Engineering [10.40771687966477]
AutoML promises to automate the building of end-to-end AI/ML pipelines. Despite the growing interest and high expectations, there is a dearth of information about the extent to which AutoML is currently adopted.
arXiv Detail & Related papers (2023-07-20T11:14:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.