Related papers: From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

URL: http://arxiv.org/abs/2510.27452v1
Date: Fri, 31 Oct 2025 13:00:49 GMT
Title: From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Authors: Jianwen Sun, Fanrui Zhang, Yukang Feng, Chuanhao Li, Zizhen Li, Jiaxin Ai, Yifan Chang, Yu Dai, Kaipeng Zhang,
Abstract summary: VisPainter is a multi-agent framework for scientific illustration built upon the model context protocol.<n>It orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software.<n>It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost.
Score: 38.72208780072352
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of "writing-compiling-reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.

Related papers

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z)
Factuality Matters: When Image Generation and Editing Meet Structured Visuals [46.627460447235855]
We construct a large-scale dataset of 1.3 million high-quality structured image pairs.<n>We train a unified model that integrates a VLM with FLUX.1 Kontext.<n>A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation.
arXiv Detail & Related papers (2025-10-06T17:56:55Z)
EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing [170.71134330650796]
EdiVal-Agent is an object-centric evaluation framework for instruction-based image editing.<n>It is designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision.<n>We build EdiVal-Bench, a benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms.
arXiv Detail & Related papers (2025-09-16T17:45:39Z)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models [88.398085358514]
DICE is a model designed to detect localized differences between the original and the edited image.<n>It is trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision.<n>We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment.
arXiv Detail & Related papers (2025-05-26T18:00:10Z)
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture. The proposed model is trained separately to map text embeddings to image embeddings of CLIP. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z)
Adaptively-Realistic Image Generation from Stroke and Sketch with Diffusion Model [31.652827838300915]
We propose a unified framework supporting a three-dimensional control over the image synthesis from sketches and strokes based on diffusion models. Our framework achieves state-of-the-art performance while providing flexibility in generating customized images with control over shape, color, and realism. Our method unleashes applications such as editing on real images, generation with partial sketches and strokes, and multi-domain multi-modal synthesis.
arXiv Detail & Related papers (2022-08-26T13:59:26Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.