Related papers: Copilot Arena: A Platform for Code LLM Evaluation in the Wild

Copilot Arena: A Platform for Code LLM Evaluation in the Wild

URL: http://arxiv.org/abs/2502.09328v1
Date: Thu, 13 Feb 2025 13:40:52 GMT
Title: Copilot Arena: A Platform for Code LLM Evaluation in the Wild
Authors: Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar,
Abstract summary: Copilot Arena is a platform to collect user preferences for code generation through native integration into a developer's working environment.<n>Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements.
Score: 44.33771124408514
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the more realistic distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.

Related papers

A Human Centric Requirements Engineering Framework for Assessing Github Copilot Output [0.0]
GitHub Copilot introduces new challenges in how these software tools address human needs.<n>I analyzed GitHub Copilot's interaction with users through its chat interface.<n>I established a human-centered requirements framework with clear metrics to evaluate these qualities.
arXiv Detail & Related papers (2025-08-05T21:33:23Z)
Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows [66.1850490474361]
We conduct the first academic study to explore developer interactions with coding agents.<n>We evaluate two leading copilot and agentic coding assistants, GitHub Copilot and OpenHands.<n>Our results show agents have the potential to assist developers in ways that surpass copilots.
arXiv Detail & Related papers (2025-07-10T20:12:54Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy [5.985777189633703]
AI copilots represent a new generation of AI-powered systems designed to assist users in complex, context-rich tasks.<n>Central to this personalization is preference optimization: the system's ability to detect, interpret, and align with individual user preferences.<n>This survey examines how user preferences are operationalized in AI copilots.
arXiv Detail & Related papers (2025-05-28T02:52:39Z)
SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering [0.7511028207083381]
We introduce SE Arena, an interactive platform designed to evaluate foundation models (FMs) in software engineering activities. SE Arena provides a transparent, open-source leaderboard, supports multi-round conversational score, and enables end-to-end model comparisons. This paper outlines the design and capabilities of SE Arena, emphasizing its potential to advance the evaluation and practical application of FMs in software engineering.
arXiv Detail & Related papers (2025-02-03T22:19:28Z)
Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues [54.81155589931697]
Collaborative Instance object Navigation (CoIN) is a new task setting where the agent actively resolve uncertainties about the target instance. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA) First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description. An Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation.
arXiv Detail & Related papers (2024-12-02T08:16:38Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
Contextual Augmented Multi-Model Programming (CAMP): A Hybrid Local-Cloud Copilot Framework [8.28588489551341]
This paper presents CAMP, a multi-model AI-assisted programming framework that consists of a local model that employs Retrieval-Augmented Generation (RAG) RAG retrieves contextual information from the cloud model to facilitate context-aware prompt construction. The methodology is actualized in Copilot for Xcode, an AI-assisted programming tool crafted for the Apple software ecosystem.
arXiv Detail & Related papers (2024-10-20T04:51:24Z)
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences [30.744662265421788]
Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. We introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm.
arXiv Detail & Related papers (2024-08-26T17:58:20Z)
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development.<n>This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z)
Octopus: Embodied Vision-Language Programmer from Environmental Feedback [58.04529328728999]
Embodied vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. To bridge this gap, we introduce Octopus, an embodied vision-language programmer that uses executable code generation as a medium to connect planning and manipulation. Octopus is designed to 1) proficiently comprehend an agent's visual and textual task objectives, 2) formulate intricate action sequences, and 3) generate executable code.
arXiv Detail & Related papers (2023-10-12T17:59:58Z)
Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM) ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z)
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily. We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z)
Learning General World Models in a Handful of Reward-Free Deployments [53.06205037827802]
Building generally capable agents is a grand challenge for deep reinforcement learning (RL) We present CASCADE, a novel approach for self-supervised exploration in this new setting. We show that CASCADE collects diverse task-agnostic datasets and learns agents that zero-shot to novel, unseen downstream tasks.
arXiv Detail & Related papers (2022-10-23T12:38:03Z)
GitHub Copilot AI pair programmer: Asset or Liability? [14.572381978575182]
We study the capabilities of Copilot in two different programming tasks. We compare Copilot's proposed solutions with those of human programmers on a set of programming tasks. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems.
arXiv Detail & Related papers (2022-06-30T15:00:03Z)
Towards Optimal Strategies for Training Self-Driving Perception Models in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone. Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator. We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z)
An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions [8.285068188878578]
GitHub Copilot is a language model trained over open-source GitHub code. Code often contains bugs - and so, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot's code contributions.
arXiv Detail & Related papers (2021-08-20T17:30:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.