Related papers: SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

URL: http://arxiv.org/abs/2408.14354v1
Date: Mon, 26 Aug 2024 15:30:05 GMT
Title: SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Authors: Daoguang Zan, Zhirong Huang, Ailun Yu, Shaoxin Lin, Yifan Shi, Wei Liu, Dong Chen, Zongshuai Qi, Hao Yu, Lei Yu, Dezhi Ran, Muhan Zeng, Bo Shen, Pan Bian, Guangtai Liang, Bei Guan, Pengjie Huang, Tao Xie, Yongji Wang, Qianxiang Wang,
Abstract summary: SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs) As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it.
Score: 27.226354754864783
License: http://creativecommons.org/licenses/by/4.0/
Abstract: GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large language models (LLMs), but has so far only focused on Python version. However, supporting more programming languages is also important, as there is a strong demand in industry. As a first step toward multilingual support, we have developed a Java version of SWE-bench, called SWE-bench-java. We have publicly released the dataset, along with the corresponding Docker-based evaluation environment and leaderboard, which will be continuously maintained and updated in the coming months. To verify the reliability of SWE-bench-java, we implement a classic method SWE-agent and test several powerful LLMs on it. As is well known, developing a high-quality multi-lingual benchmark is time-consuming and labor-intensive, so we welcome contributions through pull requests or collaboration to accelerate its iteration and refinement, paving the way for fully automated programming.

Related papers

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
We introduce SWE-PolyBench, a new benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code. Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks.
arXiv Detail & Related papers (2025-04-11T17:08:02Z)
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving [25.97486916095315]
We introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods. We launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets.
arXiv Detail & Related papers (2025-04-03T14:06:17Z)
EnvBench: A Benchmark for Automated Environment Setup [76.02998475135824]
Large Language Models have enabled researchers to focus on practical repository-level tasks in software engineering domain. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench.
arXiv Detail & Related papers (2025-03-18T17:19:12Z)
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z)
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? [64.34184587727334]
We propose SWE-bench Multimodal to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization.
arXiv Detail & Related papers (2024-10-04T18:48:58Z)
A Multi-objective Optimization Benchmark Test Suite for Real-time Semantic Segmentation [22.707825213534125]
Hardware-aware Neural Architecture (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs) We introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs. We present a benchmark test suite, CitySeg/MOP, fifteen MOPs derived from the Cityscapes dataset.
arXiv Detail & Related papers (2024-04-25T00:30:03Z)
AutoCodeRover: Autonomous Program Improvement [8.66280420062806]
We propose an automated approach for solving GitHub issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. Experiments on SWE-bench-lite (300 real-life GitHub issues) show increased efficacy in solving GitHub issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported SWE-agent.
arXiv Detail & Related papers (2024-04-08T11:55:09Z)
DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle. Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
JaxMARL: Multi-Agent RL Environments and Algorithms in JAX [105.343918678781]
We present JaxMARL, the first open-source, Python-based library that combines GPU-enabled efficiency with support for a large number of commonly used MARL environments. Our experiments show that, in terms of wall clock time, our JAX-based training pipeline is around 14 times faster than existing approaches. We also introduce and benchmark SMAX, a JAX-based approximate reimplementation of the popular StarCraft Multi-Agent Challenge.
arXiv Detail & Related papers (2023-11-16T18:58:43Z)
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [80.52201658231895]
SWE-bench is an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. We show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues.
arXiv Detail & Related papers (2023-10-10T16:47:29Z)
ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet [95.39817519115394]
ESPnet-SLU is a project inside end-to-end speech processing toolkit, ESPnet. It is designed for quick development of spoken language understanding in a single framework.
arXiv Detail & Related papers (2021-11-29T17:05:49Z)
MOROCCO: Model Resource Comparison Framework [61.444083353087294]
We present MOROCCO, a framework to compare language models compatible with ttjiant environment which supports over 50 NLU tasks. We demonstrate its applicability for two GLUE-like suites in different languages.
arXiv Detail & Related papers (2021-04-29T13:01:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.