MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
- URL: http://arxiv.org/abs/2404.09486v2
- Date: Thu, 26 Sep 2024 09:31:48 GMT
- Title: MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
- Authors: Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, Jing Ma,
- Abstract summary: MMCode is the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts.
MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges.
- Score: 9.56366641717606
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models have demonstrated remarkable abilities in visual reasoning and mathematical tasks, there is little work on investigating whether these models can effectively interpret visual elements for code generation. To this end, we present MMCode, the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts. MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites, presenting significant challenges due to the extreme demand for reasoning abilities. Our experiment results show that current state-of-the-art models struggle to solve these problems. The results highlight the lack of powerful vision-code models, and we hope MMCode can serve as an inspiration for future works in this domain. The data and code are publicly available at https://github.com/likaixin2000/MMCode.
Related papers
- Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities [3.196398766265106]
This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs)
It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart.
Experimental results demonstrate that there is a large performance difference between proprietary and open-source models.
arXiv Detail & Related papers (2025-02-17T14:25:45Z) - WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models [67.15146980023621]
We propose WarriorCoder, a novel paradigm learns from expert battles to address limitations of current approaches.
We create an arena where leading expert code LLMs challenge each other, with evaluations conducted by impartial judges.
This competitive framework generates novel training data from scratch, leveraging the strengths of all participants.
arXiv Detail & Related papers (2024-12-23T08:47:42Z) - MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents.
Existing benchmarks mostly assess their reasoning abilities in language part.
MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z) - ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges [20.316852491762788]
We propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs.
ScratchEval is based on Scratch, a block-based visual programming language widely used in children's programming education.
arXiv Detail & Related papers (2024-11-28T05:51:45Z) - Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation [0.24578723416255752]
We evaluate five different large language models (LLMs) concerning their capabilities for text-to-code generation.
ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama.
arXiv Detail & Related papers (2024-09-06T10:03:49Z) - Large Language Models for Code Summarization [0.0]
We review how Large Language Models perform in code explanation/summarization.
We also investigate their code generation capabilities based on natural language descriptions.
arXiv Detail & Related papers (2024-05-29T12:18:51Z) - Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by
Imitating Human Thought Processes [6.512667145063511]
We propose a novel approach, named Brain, to imitate human thought processes to enhance mathematical reasoning abilities.
First, we achieve SOTA performance in comparison with Code LLaMA 7B based models through this method.
Secondly, we find that plans can be explicitly extracted from natural language, code, or formal language.
arXiv Detail & Related papers (2024-02-23T17:40:31Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Predicting Defective Visual Code Changes in a Multi-Language AAA Video
Game Project [54.20154707138088]
We focus on constructing visual code defect prediction models that encompass visual code metrics.
We test our models using features extracted from the historical agnostic of a AAA video game project.
We find that defect prediction models have better performance overall in terms of the area under the ROC curve.
arXiv Detail & Related papers (2023-09-07T00:18:43Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.