Related papers: OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

URL: http://arxiv.org/abs/2505.04606v1
Date: Wed, 07 May 2025 17:51:10 GMT
Title: OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
Authors: Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng,
Abstract summary: GitHub issue resolution task aims to resolve issues reported in repositories automatically.<n>With advances in large language models (LLMs), this task has gained increasing attention.<n>We propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain.
Score: 34.087547492498224
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs' failure on OmniGIRL, providing insights for future improvements.

Related papers

A Benchmark for Localizing Code and Non-Code Issues in Software Projects [26.511673758202267]
We introduce MULocBench, a dataset of 1,100 issues from 46 popular GitHub Python projects.<n>Compared with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types.<n>Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies.
arXiv Detail & Related papers (2025-09-26T06:05:20Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs.<n>It is constructed using a dataset curated from 30 well-known GitHub repositories.<n>We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC) The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z)
Evaluating LLMs with Multiple Problems at once [9.173325772800341]
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once.<n>We introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts.<n>Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short.
arXiv Detail & Related papers (2024-06-16T02:52:32Z)
MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution [47.850418420195304]
Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving GitHub issues. We propose a novel Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution.
arXiv Detail & Related papers (2024-03-26T17:57:57Z)
PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence. The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z)
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models [3.518832148294879]
How to evaluate the question-solving abilities of large language models like ChatGPT is a hot-spot but challenging issue. We propose AGIBench -- a multi-granularity, multimodal, human-referenced, and auto-scoring benchmarking methodology for LLMs.
arXiv Detail & Related papers (2023-09-05T13:43:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.