Related papers: Remote Labor Index: Measuring AI Automation of Remote Work

Remote Labor Index: Measuring AI Automation of Remote Work

URL: http://arxiv.org/abs/2510.26787v1
Date: Thu, 30 Oct 2025 17:58:04 GMT
Title: Remote Labor Index: Measuring AI Automation of Remote Work
Authors: Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Jaehyuk Lim, Lachlan Carroll, Alice Blair, Vinaya Sivakumar, Sumana Basu, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu, Zaina Shaik, Adam Khoja, Richard Ren, Jason Hausenloy, Long Phan, Ye Htet, Ankit Aich, Tahseen Rabbani, Vivswan Shah, Andriy Novykov, Felix Binder, Kirill Chugunov, Luis Ramirez, Matias Geralnik, Hernán Mesura, Dean Lee, Ed-Yeremai Hernandez Cardona, Annette Diamond, Summer Yue, Alexandr Wang, Bing Liu, Ernesto Hernandez, Dan Hendrycks,
Abstract summary: AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation.<n>To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects.
Score: 46.53553410123801
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.

Related papers

EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z)
Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents [0.20999222360659608]
The insurance industry is undergoing a paradigm shift through the adoption of artificial intelligence (AI) technologies.<n>This paper presents the design, implementation, and empirical evaluation of Axlerod, an AI-powered conversational interface designed to improve the operational efficiency of independent insurance agents.
arXiv Detail & Related papers (2025-12-24T15:31:59Z)
AgentEvolver: Towards Efficient Self-Evolving Agent System [51.54882384204726]
We present AgentEvolver, a self-evolving agent system that drives autonomous agent learning.<n>AgentEvolver introduces three synergistic mechanisms: self-questioning, self-navigating, and self-attributing.<n>Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.
arXiv Detail & Related papers (2025-11-13T15:14:47Z)
ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning [49.25518866694287]
We propose ML-Master, a novel AI4AI agent that seamlessly integrates exploration and reasoning by employing a selectively scoped memory mechanism.<n>We evaluate ML-Master on the MLE-Bench, where it achieves a 29.3% average medal rate, significantly surpassing existing methods.
arXiv Detail & Related papers (2025-06-19T17:53:28Z)
AutoMind: Adaptive Knowledgeable Agent for Automated Data Science [70.33796196103499]
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems.<n>Existing frameworks depend on rigid, pre-defined and inflexible coding strategies.<n>We introduce AutoMind, an adaptive, knowledgeable LLM-agent framework.
arXiv Detail & Related papers (2025-06-12T17:59:32Z)
Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents [9.675331256405443]
Large Language Models (LLMs) have been increasingly used as assistants for data science.<n>In this paper, we survey the evaluation of LLM assistants and agents for data science.
arXiv Detail & Related papers (2025-06-10T13:47:22Z)
E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing [1.5728609542259502]
This paper presents an intelligent work automation approach in the context of contemporary digital transformation.<n>It integrates generative AI and Intelligent Document Processing technologies with an Automation Agent to realize End-to-End (E2E) automation of corporate financial expense processing tasks.
arXiv Detail & Related papers (2025-05-27T05:21:08Z)
General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.<n>We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.<n>Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z)
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [55.03911355902567]
We introduce TheAgentCompany, a benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker.<n>We find that the most competitive agent can complete 30% of tasks autonomously.<n>This paints a nuanced picture on task automation with simulating LM agents in a setting a real workplace.
arXiv Detail & Related papers (2024-12-18T18:55:40Z)
Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job? [0.0]
SelfScore is a benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation tasks. The benchmark evaluates agents on problem complexity and response helpfulness, ensuring transparency and simplicity in its scoring system. The study raises concerns about the potential displacement of human workers, especially in areas where AI technologies excel.
arXiv Detail & Related papers (2024-10-05T14:37:35Z)
Towards the Terminator Economy: Assessing Job Exposure to AI through LLMs [10.844598404826355]
One-third of U.S. employment is highly exposed to AI, primarily in high-skill jobs requiring a graduate or postgraduate level of education.<n>Even in high-skill occupations, AI exhibits high variability in task substitution, suggesting that AI and humans complement each other within the same occupation.<n>All results, models, and code are freely available online to allow the community to reproduce our results, compare outcomes, and use our work as a benchmark to monitor AI's progress over time.
arXiv Detail & Related papers (2024-07-27T08:14:18Z)
Induction and Exploitation of Subgoal Automata for Reinforcement Learning [75.55324974788475]
We present ISA, an approach for learning and exploiting subgoals in episodic reinforcement learning (RL) tasks. ISA interleaves reinforcement learning with the induction of a subgoal automaton, an automaton whose edges are labeled by the task's subgoals. A subgoal automaton also consists of two special states: a state indicating the successful completion of the task, and a state indicating that the task has finished without succeeding.
arXiv Detail & Related papers (2020-09-08T16:42:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.