Related papers: SCUBA: Salesforce Computer Use Benchmark

SCUBA: Salesforce Computer Use Benchmark

URL: http://arxiv.org/abs/2509.26506v1
Date: Tue, 30 Sep 2025 16:48:49 GMT
Title: SCUBA: Salesforce Computer Use Benchmark
Authors: Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu,
Abstract summary: SCUBA is a benchmark designed to evaluate computer-use agents on customer relationship management ( CRM) within the Salesforce platform.<n> SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents.<n>We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings.
Score: 63.66753028386581
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5\% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50\% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.

Related papers

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development [42.26354337364403]
FeatureBench is a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development.<n>It incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort.<n> Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, achieves a 74.4% resolved rate on SWE-bench.
arXiv Detail & Related papers (2026-02-11T16:06:32Z)
SWE-Universe: Scale Real-World Verifiable Environments to Millions [84.63665266236963]
SWE-Universe is a framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs)<n>We propose a building agent powered by an efficient custom-trained model to overcome the prevalent challenges of automatic building.<n>We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning.
arXiv Detail & Related papers (2026-02-02T17:20:30Z)
EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z)
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts [35.52607495764441]
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production.<n>We introduce AgencyBench, a benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios.<n>These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve.
arXiv Detail & Related papers (2026-01-16T07:22:20Z)
GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging [41.754784344572286]
We release GitTaskBench, a benchmark for evaluating code agents in real-world scenarios.<n>Each task pairs a relevant repository with an automated, human-curated evaluation harness.<n>We also propose the alpha-value metric to quantify the economic benefit of agent performance.
arXiv Detail & Related papers (2025-08-26T12:48:05Z)
OpenCUA: Open Foundations for Computer-Use Agents [74.61449905487565]
Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs)<n>We propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models.<n>Our end-to-end agent models demonstrate strong performance across CUA benchmarks.
arXiv Detail & Related papers (2025-08-12T17:52:32Z)
eSapiens: A Platform for Secure and Auditable Retrieval-Augmented Generation [10.667949307405983]
eSapiens is an AI-as-a-Service (AI) platform engineered around a business-oriented trifecta: proprietary data, operational, and any major Large Language Model (LLM)<n>eSapiens gives businesses full control over their AI assets, keeping everything in-house for AI knowledge retention and data security.
arXiv Detail & Related papers (2025-07-13T11:41:44Z)
SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments [2.184775414778289]
We introduce setupbench, a benchmark that isolates the environment-bootstrap skill.<n>Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios.<n>We find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%)
arXiv Detail & Related papers (2025-07-11T22:45:07Z)
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs [19.766885088032932]
Software engineering (SWE) has emerged as a crucial testbed for next-generation LLM agents.<n>Most existing datasets are limited to only a few thousand GitHub-sourced instances.<n>We propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets.
arXiv Detail & Related papers (2025-06-24T03:53:36Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.<n>Our framework supports multiple devices and can be easily extended to any environment with a Python interface.<n>The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.