Related papers: AppSelectBench: Application-Level Tool Selection Benchmark

AppSelectBench: Application-Level Tool Selection Benchmark

URL: http://arxiv.org/abs/2511.19957v1
Date: Tue, 25 Nov 2025 06:06:17 GMT
Title: AppSelectBench: Application-Level Tool Selection Benchmark
Authors: Tianyi Chen, Michael Solodko, Sen Wang, Jongwoo Ko, Junheng Hao, Colby Banbury, Sara Abdali, Saeed Amizadeh, Qing Xiao, Yinheng Li, Tianyu Ding, Kamran Ghasedi Dizaji, Suzhen Zheng, Hao Fan, Justin Wagle, Pashmina Cameron, Kazuhito Koishida,
Abstract summary: AppSelectBench is a benchmark for evaluating application selection in computer using agents (CUAs)<n>It contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale.<n>It includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks.
Score: 57.03660843195562
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at https://github.com/microsoft/appselectbench.

Related papers

Personalized Recommendations via Active Utility-based Pairwise Sampling [1.704905100460915]
We propose a utility-based framework that learns preferences from simple and intuitive pairwise comparisons.<n>A central contribution of our work is a novel utility-based active sampling strategy for preference elicitation.
arXiv Detail & Related papers (2025-08-12T19:09:33Z)
UserBench: An Interactive Gym Environment for User-Centric Agents [110.77212949007958]
Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, but their ability to proactively collaborate with users remains underexplored.<n>We introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions.
arXiv Detail & Related papers (2025-07-29T17:34:12Z)
Implementing Rational Choice Functions with LLMs and Measuring their Alignment with User Preferences [15.72977233489024]
We put forward design principles for using large language models to implement rational choice functions.<n>We demonstrate the applicability of our approach through an empirical study in a practical application of an IUI in the automotive domain.
arXiv Detail & Related papers (2025-04-22T09:08:21Z)
Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z)
Employing Continuous Integration inspired workflows for benchmarking of scientific software -- a use case on numerical cut cell quadrature [0.3387808070669509]
This paper presents a proven approach that utilizes established Continuous Integration tools and practices to achieve high automation of benchmark execution and reporting.<n>Our use case is the numerical integration (quadrature) on arbitrary domains, which are bounded by implicitly or parametrically defined curves or surfaces in 2D or 3D.
arXiv Detail & Related papers (2025-03-21T14:42:24Z)
FREYR: A Framework for Recognizing and Executing Your Requests [2.4797200957733576]
This paper introduces FREYR, a streamlined framework that modularizes the tool usage process into separate steps.<n>We show that FREYR achieves superior performance compared to conventional tool usage methods.<n>We evaluate FREYR on a set of real-world test cases specific for video game design and compare it against traditional tool usage as provided by the Ollama API.
arXiv Detail & Related papers (2025-01-21T11:08:18Z)
Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars [66.823588073584]
Large language models (LLMs) have shown impressive capabilities in real-world applications. The quality of these exemplars in the prompt greatly impacts performance. Existing methods fail to adequately account for the impact of exemplar ordering on the performance.
arXiv Detail & Related papers (2024-05-25T08:23:05Z)
Supervised Embedded Methods for Hyperspectral Band Selection [12.09273192079783]
Hyperspectral Imaging (HSI) captures rich spectral information across contiguous wavelength bands.<n>HSI supports applications in precision agriculture, environmental monitoring, and autonomous driving.<n>We propose two novel supervised, embedded methods for task-specific HSI band selection.
arXiv Detail & Related papers (2024-01-21T07:48:39Z)
Cache & Distil: Optimising API Calls to Large Language Models [82.32065572907125]
Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries. To curtail the frequency of these calls, one can employ a smaller language model -- a student. This student gradually gains proficiency in independently handling an increasing number of user requests.
arXiv Detail & Related papers (2023-10-20T15:01:55Z)
When to Use What: An In-Depth Comparative Empirical Analysis of OpenIE Systems for Downstream Applications [0.0]
We present an application-focused empirical survey of neural OpenIE models, training sets, and benchmarks. We find that the different assumptions made by different models and datasets have a statistically significant effect on performance.
arXiv Detail & Related papers (2022-11-15T15:48:27Z)
Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models [116.25562358482962]
State-of-the-art neural language models can be used to solve ad-hoc language tasks without the need for supervised training. PromptIDE allows users to experiment with prompt variations, visualize prompt performance, and iteratively optimize prompts.
arXiv Detail & Related papers (2022-08-16T17:17:53Z)
Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads. We explore various attention-based contexts, such as global and local, in the multi-task setting. We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.