Copilot Arena: A Platform for Code LLM Evaluation in the Wild
- URL: http://arxiv.org/abs/2502.09328v1
- Date: Thu, 13 Feb 2025 13:40:52 GMT
- Title: Copilot Arena: A Platform for Code LLM Evaluation in the Wild
- Authors: Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar,
- Abstract summary: Copilot Arena is a platform to collect user preferences for code generation through native integration into a developer's working environment.
Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements.
- Score: 44.33771124408514
- License:
- Abstract: Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the more realistic distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.
Related papers
- CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - Contextual Augmented Multi-Model Programming (CAMP): A Hybrid Local-Cloud Copilot Framework [8.28588489551341]
This paper presents CAMP, a multi-model AI-assisted programming framework that consists of a local model that employs Retrieval-Augmented Generation (RAG)
RAG retrieves contextual information from the cloud model to facilitate context-aware prompt construction.
The methodology is actualized in Copilot for Xcode, an AI-assisted programming tool crafted for the Apple software ecosystem.
arXiv Detail & Related papers (2024-10-20T04:51:24Z) - K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences [30.744662265421788]
Arena platform, which gathers user votes on model comparisons, can rank models with human preferences.
We introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts.
In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm.
arXiv Detail & Related papers (2024-08-26T17:58:20Z) - Generative AI for Pull Request Descriptions: Adoption, Impact, and
Developer Interventions [11.620351603683496]
GitHub's Copilot for Pull Requests (PRs) is a promising service aiming to automate various developer tasks related to PRs.
In this study, we examine 18,256 PRs in which parts of the descriptions were crafted by generative AI.
Our findings indicate that Copilot for PRs, though in its infancy, is seeing a marked uptick in adoption.
arXiv Detail & Related papers (2024-02-14T06:20:57Z) - Octopus: Embodied Vision-Language Programmer from Environmental Feedback [58.04529328728999]
Embodied vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning.
To bridge this gap, we introduce Octopus, an embodied vision-language programmer that uses executable code generation as a medium to connect planning and manipulation.
Octopus is designed to 1) proficiently comprehend an agent's visual and textual task objectives, 2) formulate intricate action sequences, and 3) generate executable code.
arXiv Detail & Related papers (2023-10-12T17:59:58Z) - Efficient Adaptive Human-Object Interaction Detection with
Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM)
ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm.
Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z) - Learning General World Models in a Handful of Reward-Free Deployments [53.06205037827802]
Building generally capable agents is a grand challenge for deep reinforcement learning (RL)
We present CASCADE, a novel approach for self-supervised exploration in this new setting.
We show that CASCADE collects diverse task-agnostic datasets and learns agents that zero-shot to novel, unseen downstream tasks.
arXiv Detail & Related papers (2022-10-23T12:38:03Z) - GitHub Copilot AI pair programmer: Asset or Liability? [14.572381978575182]
We study the capabilities of Copilot in two different programming tasks.
We compare Copilot's proposed solutions with those of human programmers on a set of programming tasks.
The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems.
arXiv Detail & Related papers (2022-06-30T15:00:03Z) - Towards Optimal Strategies for Training Self-Driving Perception Models
in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone.
Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator.
We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z) - An Empirical Cybersecurity Evaluation of GitHub Copilot's Code
Contributions [8.285068188878578]
GitHub Copilot is a language model trained over open-source GitHub code.
Code often contains bugs - and so, it is certain that the language model will have learned from exploitable, buggy code.
This raises concerns on the security of Copilot's code contributions.
arXiv Detail & Related papers (2021-08-20T17:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.