OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software
- URL: http://arxiv.org/abs/2505.23239v1
- Date: Thu, 29 May 2025 08:40:10 GMT
- Title: OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software
- Authors: Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou,
- Abstract summary: Our framework employs intelligent agents powered by large language models (LLMs) to simulate developers performing programming tasks.<n> OSS-UAgent ensures accurate and context-aware code generation.<n>Our demonstration showcases OSS-UAgent's practical application in evaluating graph analytics platforms.
- Score: 47.02288620982592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Usability evaluation is critical to the impact and adoption of open source software (OSS), yet traditional methods relying on human evaluators suffer from high costs and limited scalability. To address these limitations, we introduce OSS-UAgent, an automated, configurable, and interactive agent-based usability evaluation framework specifically designed for open source software. Our framework employs intelligent agents powered by large language models (LLMs) to simulate developers performing programming tasks across various experience levels (from Junior to Expert). By dynamically constructing platform-specific knowledge bases, OSS-UAgent ensures accurate and context-aware code generation. The generated code is automatically evaluated across multiple dimensions, including compliance, correctness, and readability, providing a comprehensive measure of the software's usability. Additionally, our demonstration showcases OSS-UAgent's practical application in evaluating graph analytics platforms, highlighting its effectiveness in automating usability evaluation.
Related papers
- SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience [71.82719117238307]
We propose SEAgent, an agentic self-evolving framework enabling computer-use agents to evolve through interactions with unfamiliar software.<n>We validate the effectiveness of SEAgent across five novel software environments within OS-World.<n>Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA.
arXiv Detail & Related papers (2025-08-06T17:58:46Z) - Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training [67.895981259683]
General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence.<n>Current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools.<n>We present Cognitive Kernel-Pro, a fully open-source and (to the maximum extent) free multi-module agent framework.
arXiv Detail & Related papers (2025-08-01T08:11:31Z) - State and Memory is All You Need for Robust and Reliable AI Agents [29.259008600842517]
Large language models (LLMs) have enabled powerful advances in natural language understanding and generation.<n>Yet their application to complex, real-world scientific remain limited by challenges in memory, planning, and tool integration.<n>Here, we introduce SciBORG, a modular agentic framework that allows LLM-based agents to autonomously plan, reason, and achieve robust and reliable domain-specific task execution.
arXiv Detail & Related papers (2025-06-30T02:02:35Z) - GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents [19.46051971038257]
GSO is a benchmark for evaluating language models' capabilities in developing high-performance software.<n>SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling.<n>We release the code and artifacts of our benchmark along with agent trajectories to enable future research.
arXiv Detail & Related papers (2025-05-29T17:14:55Z) - MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models [11.809732662992982]
This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate Large Language Models (LLMs) performance in the Model Context Protocol (MCP) framework.<n>Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains.
arXiv Detail & Related papers (2025-05-22T14:02:37Z) - REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites [9.58858258192147]
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites.<n>We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions.<n>Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation.
arXiv Detail & Related papers (2025-04-15T18:22:55Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Watson: A Cognitive Observability Framework for the Reasoning of LLM-Powered Agents [7.392058124132526]
Foundations models (FMs) play an increasingly prominent role in complex software systems, such as agentic software.<n>Fast-thinking Large Language Models (LLMs) are still preferred due to latency constraints.<n>We introduce Watson, a framework that provides reasoning observability into implicit reasoning processes.
arXiv Detail & Related papers (2024-11-05T19:13:22Z) - SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [18.84439000902905]
Current large language model (LLM)-based software agents often follow linear, sequential processes.<n>We propose SWE-Search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism.<n>This highlights the potential of self-evaluation driven search techniques in complex software engineering environments.
arXiv Detail & Related papers (2024-10-26T22:45:56Z) - Self-Evolving Multi-Agent Collaboration Networks for Software Development [32.78667834175446]
We introduce EvoMAC, a novel self-evolving paradigm for MAC networks.
Inspired by traditional neural network training, EvoMAC obtains text-based environmental feedback.
We propose rSDE-Bench, a requirement-oriented software development benchmark.
arXiv Detail & Related papers (2024-10-22T12:20:23Z) - Agent-Driven Automatic Software Improvement [55.2480439325792]
This research proposal aims to explore innovative solutions by focusing on the deployment of agents powered by Large Language Models (LLMs)
The iterative nature of agents, which allows for continuous learning and adaptation, can help surpass common challenges in code generation.
We aim to use the iterative feedback in these systems to further fine-tune the LLMs underlying the agents, becoming better aligned to the task of automated software improvement.
arXiv Detail & Related papers (2024-06-24T15:45:22Z) - WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers.
WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform.
BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.