Related papers: SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

URL: http://arxiv.org/abs/2502.06994v1
Date: Mon, 10 Feb 2025 19:38:36 GMT
Title: SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering
Authors: Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji,
Abstract summary: SyncMind is a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in software engineering.<n>Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE.
Score: 74.04271300772155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants -- whether humans or AI agents -- to stay on the same page as their environment evolves. When a collaborator's understanding diverges from the current state -- what we term the out-of-sync challenge -- the collaborator's actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents' capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agent <= 3.33% to Claude-3.5-Sonnet >= 28.18%), their consistently low collaboration willingness (<= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents' resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems. Code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.

Related papers

SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [10.70881967278009]
We present SWE- Synth, a framework for synthesizing realistic verifiable, and process-aware bug-fix datasets at the repository level. Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness. Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
arXiv Detail & Related papers (2025-04-20T22:37:43Z)
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems [80.69865295743149]
This work attempts to study using LLM-based agents to design collaborative AI systems autonomously.<n>Based on ComfyBench, we develop ComfyAgent, a framework that empowers agents to autonomously design collaborative AI systems by generating.<n>While ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks.
arXiv Detail & Related papers (2024-09-02T17:44:10Z)
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents [106.87436596397816]
Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. We propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. Experiments show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin.
arXiv Detail & Related papers (2024-08-13T17:50:28Z)
Multi-Agent Software Development through Cross-Team Collaboration [30.88149502999973]
We introduce Cross-Team Collaboration (CTC), a scalable multi-team framework for software development. CTC enables orchestrated teams to jointly propose various decisions and communicate with their insights. Results show a notable increase in quality compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-06-13T10:18:36Z)
Federated Contextual Cascading Bandits with Asynchronous Communication and Heterogeneous Users [95.77678166036561]
We propose a UCB-type algorithm with delicate communication protocols. We give sub-linear regret bounds on par with those achieved in the synchronous framework. Empirical evaluation on synthetic and real-world datasets validates our algorithm's superior performance in terms of regrets and communication costs.
arXiv Detail & Related papers (2024-02-26T05:31:14Z)
AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning [10.196574441542646]
Gradient Descent (SGD) has become the de facto way to train deep neural networks in distributed clusters. A critical factor in determining the training throughput and model accuracy is the choice of the parameter synchronization protocol. In this paper, we design a hybrid synchronization approach that exploits the benefits of both BSP and ASP.
arXiv Detail & Related papers (2021-04-16T20:49:28Z)
A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks [111.34055449929487]
We introduce the novel task FurnMove in which agents work together to move a piece of furniture through a living room to a goal. Unlike existing tasks, FurnMove requires agents to coordinate at every timestep. We identify two challenges when training agents to complete FurnMove: existing decentralized action sampling procedures do not permit expressive joint action policies. Using SYNC-policies and CORDIAL, our agents achieve a 58% completion rate on FurnMove, an impressive absolute gain of 25 percentage points over competitive decentralized baselines.
arXiv Detail & Related papers (2020-07-09T17:59:57Z)
DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training [15.246142393381488]
We present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. We show that DS-Sync can achieve up to $94%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.
arXiv Detail & Related papers (2020-07-07T09:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.