MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
- URL: http://arxiv.org/abs/2404.09486v2
- Date: Thu, 26 Sep 2024 09:31:48 GMT
- Title: MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
- Authors: Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, Jing Ma,
- Abstract summary: MMCode is the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts.
MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges.
- Score: 9.56366641717606
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models have demonstrated remarkable abilities in visual reasoning and mathematical tasks, there is little work on investigating whether these models can effectively interpret visual elements for code generation. To this end, we present MMCode, the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts. MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites, presenting significant challenges due to the extreme demand for reasoning abilities. Our experiment results show that current state-of-the-art models struggle to solve these problems. The results highlight the lack of powerful vision-code models, and we hope MMCode can serve as an inspiration for future works in this domain. The data and code are publicly available at https://github.com/likaixin2000/MMCode.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks [25.959032350818795]
HumanEval-V is a benchmark designed to evaluate Large Language Models' visual understanding and reasoning capabilities through code generation.
HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow.
We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges.
arXiv Detail & Related papers (2024-10-16T09:04:57Z) - Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation [0.24578723416255752]
We evaluate five different large language models (LLMs) concerning their capabilities for text-to-code generation.
ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama.
arXiv Detail & Related papers (2024-09-06T10:03:49Z) - Large Language Models for Code Summarization [0.0]
We review how Large Language Models perform in code explanation/summarization.
We also investigate their code generation capabilities based on natural language descriptions.
arXiv Detail & Related papers (2024-05-29T12:18:51Z) - A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond [84.95530356322621]
This survey presents a systematic review of the advancements in code intelligence.
It covers over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works.
Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence.
arXiv Detail & Related papers (2024-03-21T08:54:56Z) - Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by
Imitating Human Thought Processes [6.512667145063511]
We propose a novel approach, named Brain, to imitate human thought processes to enhance mathematical reasoning abilities.
First, we achieve SOTA performance in comparison with Code LLaMA 7B based models through this method.
Secondly, we find that plans can be explicitly extracted from natural language, code, or formal language.
arXiv Detail & Related papers (2024-02-23T17:40:31Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - Predicting Defective Visual Code Changes in a Multi-Language AAA Video
Game Project [54.20154707138088]
We focus on constructing visual code defect prediction models that encompass visual code metrics.
We test our models using features extracted from the historical agnostic of a AAA video game project.
We find that defect prediction models have better performance overall in terms of the area under the ROC curve.
arXiv Detail & Related papers (2023-09-07T00:18:43Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.