MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems
- URL: http://arxiv.org/abs/2404.09486v1
- Date: Mon, 15 Apr 2024 06:15:46 GMT
- Title: MMCode: Evaluating Multi-Modal Code Large Language Models with Visually Rich Programming Problems
- Authors: Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Jing Ma,
- Abstract summary: MMCode is the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts.
MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges.
- Score: 9.155143207283295
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models have demonstrated remarkable abilities in visual reasoning and mathematical tasks, there is little work on investigating whether these models can effectively interpret visual elements for code generation. To this end, we present MMCode, the first multi-modal coding dataset for evaluating algorithmic problem-solving skills in visually rich contexts. MMCode contains 3,548 questions and 6,620 images collected from real-world programming challenges harvested from 10 code competition websites, presenting significant challenges due to the extreme demand for reasoning abilities. Our experiment results show that current state-of-the-art models struggle to solve these problems. The results highlight the lack of powerful vision-code models, and we hope MMCode can serve as an inspiration for future works in this domain. The data and code are publicly available at https://github.com/happylkx/MMCode.
Related papers
- Visual Haystacks: Answering Harder Questions About Sets of Images [63.296342841358815]
This paper explores the task of Multi-Image Visual Question Answering (MIQA)
Given a large set of images and a natural language query, the task is to generate a relevant and grounded response.
We introduce MIRAGE, a novel retrieval/QA framework tailored for Large Multimodal Models (LMMs)
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models [28.295926947968574]
Large language models (LLMs) have brought a paradigm shift to the field of code generation.
We empirically analyze the differences in coding style between the code generated by Code LLMs and the code written by human developers.
arXiv Detail & Related papers (2024-06-29T14:56:11Z) - Large Language Models for Code Summarization [0.0]
We review how Large Language Models perform in code explanation/summarization.
We also investigate their code generation capabilities based on natural language descriptions.
arXiv Detail & Related papers (2024-05-29T12:18:51Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - Brain-Inspired Two-Stage Approach: Enhancing Mathematical Reasoning by
Imitating Human Thought Processes [6.512667145063511]
We propose a novel approach, named Brain, to imitate human thought processes to enhance mathematical reasoning abilities.
First, we achieve SOTA performance in comparison with Code LLaMA 7B based models through this method.
Secondly, we find that plans can be explicitly extracted from natural language, code, or formal language.
arXiv Detail & Related papers (2024-02-23T17:40:31Z) - MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders.
This technique introduces a fusion network to unify the processing of outputs from different visual experts.
In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z) - MoTCoder: Elevating Large Language Models with Modular of Thought for
Challenging Programming Tasks [60.54009036297301]
We introduce a pioneering framework for MoT instruction tuning, designed to promote the decomposition of tasks into logical sub-tasks and sub-modules.
Our investigations reveal that, through the cultivation and utilization of sub-modules, MoTCoder significantly improves both the modularity and correctness of the generated solutions.
arXiv Detail & Related papers (2023-12-26T08:49:57Z) - Predicting Defective Visual Code Changes in a Multi-Language AAA Video
Game Project [54.20154707138088]
We focus on constructing visual code defect prediction models that encompass visual code metrics.
We test our models using features extracted from the historical agnostic of a AAA video game project.
We find that defect prediction models have better performance overall in terms of the area under the ROC curve.
arXiv Detail & Related papers (2023-09-07T00:18:43Z) - Techniques to Improve Neural Math Word Problem Solvers [0.0]
Recent neural-based approaches mainly encode the problem text using a language model and decode a mathematical expression over quantities and operators iteratively.
We propose a new encoder-decoder architecture that fully leverages the question text and preserves step-wise commutative law.
Experiments on four established benchmarks demonstrate that our framework outperforms state-of-the-art neural MWP solvers.
arXiv Detail & Related papers (2023-02-06T22:41:51Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - A Sketch-Based Neural Model for Generating Commit Messages from Diffs [0.5239589676872304]
Commit messages have an important impact in software development, especially when working in large teams.
We apply neural machine translation (NMT) techniques to convert code diffs into commit messages.
We present an improved sketch-based encoder for this task.
arXiv Detail & Related papers (2021-04-08T21:21:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.