Related papers: LLM Code Customization with Visual Results: A Benchmark on TikZ

LLM Code Customization with Visual Results: A Benchmark on TikZ

URL: http://arxiv.org/abs/2505.04670v2
Date: Wed, 04 Jun 2025 12:57:19 GMT
Title: LLM Code Customization with Visual Results: A Benchmark on TikZ
Authors: Charly Reux, Mathieu Acher, Djamel Eddine Khelladi, Olivier Barais, Clément Quinton,
Abstract summary: We introduce vTikZ, the first benchmark to evaluate the ability of Large Language Models to customize code while preserving coherent visual outcomes.<n>Our benchmark consists of carefully curated vTikZ editing scenarios, parameterized ground truths, and a reviewing tool that leverages visual feedback to assess correctness.
Score: 6.3303908500560615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rise of AI-based code generation, customizing existing code out of natural language instructions to modify visual results -such as figures or images -has become possible, promising to reduce the need for deep programming expertise. However, even experienced developers can struggle with this task, as it requires identifying relevant code regions (feature location), generating valid code variants, and ensuring the modifications reliably align with user intent. In this paper, we introduce vTikZ, the first benchmark designed to evaluate the ability of Large Language Models (LLMs) to customize code while preserving coherent visual outcomes. Our benchmark consists of carefully curated vTikZ editing scenarios, parameterized ground truths, and a reviewing tool that leverages visual feedback to assess correctness. Empirical evaluation with stateof-the-art LLMs shows that existing solutions struggle to reliably modify code in alignment with visual intent, highlighting a gap in current AI-assisted code editing approaches. We argue that vTikZ opens new research directions for integrating LLMs with visual feedback mechanisms to improve code customization tasks in various domains beyond TikZ, including image processing, art creation, Web design, and 3D modeling.

Related papers

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z)
Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements [0.36832029288386137]
This study examined code issue detection and revision automation by integrating Large Language Models (LLMs) into software development.<n>A static code analysis framework detects issues such as bugs, vulnerabilities, and code smells within a large-scale software project.<n>Retrieval-augmented generation (RAG) is implemented to enhance the relevance and precision of the revisions.
arXiv Detail & Related papers (2025-06-12T03:39:25Z)
CodeVision: Detecting LLM-Generated Code Using 2D Token Probability Maps and Vision Models [28.711745671275477]
The rise of large language models (LLMs) has significantly improved automated code generation, enhancing software development efficiency.<n>Existing detection methods, such as pre-trained models and watermarking, face limitations in adaptability and computational efficiency.<n>We propose a novel detection method using 2D token probability maps combined with vision models, preserving spatial code structures.
arXiv Detail & Related papers (2025-01-06T06:15:10Z)
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing [27.578516354454063]
Editing Vision-Language Model (EVLM) is designed to interpret ambiguous instructions in conjunction with reference visuals.<n>EVLM captures subjective editing preferences without requiring binary labels.<n>Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent, high-quality instructions.
arXiv Detail & Related papers (2024-12-13T21:15:01Z)
ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges [20.316852491762788]
We propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs.<n> ScratchEval is based on Scratch, a block-based visual programming language widely used in children's programming education.
arXiv Detail & Related papers (2024-11-28T05:51:45Z)
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.<n>By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z)
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models [49.387195629660994]
Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability.<n>We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks.<n>We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks.
arXiv Detail & Related papers (2024-04-04T15:49:49Z)
Predicting Defective Visual Code Changes in a Multi-Language AAA Video Game Project [54.20154707138088]
We focus on constructing visual code defect prediction models that encompass visual code metrics. We test our models using features extracted from the historical agnostic of a AAA video game project. We find that defect prediction models have better performance overall in terms of the area under the ROC curve.
arXiv Detail & Related papers (2023-09-07T00:18:43Z)
Identifying Defect-Inducing Changes in Visual Code [54.20154707138088]
"SZZ Visual Code" (SZZ-VC) is an algorithm that finds changes in visual code based on the differences of graphical elements rather than differences of lines to detect defect-inducing changes. We validated the algorithm for an industry-made AAA video game and 20 music visual programming defects across 12 open source projects.
arXiv Detail & Related papers (2023-09-07T00:12:28Z)
Visually-augmented pretrained language models for NLP tasks without images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation. We propose a novel textbfVisually-textbfAugmented fine-tuning approach. Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z)
Towards Counterfactual Image Manipulation via CLIP [106.94502632502194]
Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP) We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
arXiv Detail & Related papers (2022-07-06T17:02:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.