Related papers: 1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World

1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World

URL: http://arxiv.org/abs/2602.18548v1
Date: Fri, 20 Feb 2026 17:46:51 GMT
Title: 1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World
Authors: Qiao Xu, Yipeng Yu, Chengxiao Feng, Xu Liu,
Abstract summary: We introduce 1D-Bench, a benchmark grounded in real e-commerce, where each instance provides a reference rendering and an exported intermediate representation.<n>1D is short for one day, representing the efficient completion of design-to-code tasks in less than one day.
Score: 5.904589000032003
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Design-to-code translates high-fidelity UI designs into executable front-end implementations, but progress remains hard to compare due to inconsistent datasets, toolchains, and evaluation protocols. We introduce 1D-Bench, a benchmark grounded in real e-commerce workflows, where each instance provides a reference rendering and an exported intermediate representation that may contain extraction errors. 1D is short for one day, representing the efficient completion of design-to-code tasks in less than one day. Models take both as input, using the intermediate representation as structural cues while being evaluated against the reference rendering, which tests robustness to intermediate representation defects rather than literal adherence. 1D-Bench requires generating an executable React codebase under a fixed toolchain with an explicit component hierarchy, and defines a multi-round setting in which models iteratively apply component-level edits using execution feedback. Experiments on commercial and open-weight multimodal models show that iterative editing generally improves final performance by increasing rendering success and often improving visual similarity. We further conduct a pilot study on post-training with synthetic repair trajectories and reinforcement learning based editing, and observe limited and unstable gains that may stem from sparse terminal rewards and high-variance file-level updates.

Related papers

ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models [4.257440824082894]
We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code.<n>Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing.<n> Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context.
arXiv Detail & Related papers (2026-02-17T17:45:34Z)
VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z)
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits [43.59555184340113]
We introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage.<n>For scalable quality control, we train a 7B dual-task expert model, textbfQwen-Verify, for efficient failure detection and instruction recaptioning.<n>This pipeline yields textbfUnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks.
arXiv Detail & Related papers (2025-12-01T17:45:44Z)
EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing [170.71134330650796]
EdiVal-Agent is an object-centric evaluation framework for instruction-based image editing.<n>It is designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision.<n>We build EdiVal-Bench, a benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms.
arXiv Detail & Related papers (2025-09-16T17:45:39Z)
HEAS: Hierarchical Evolutionary Agent Simulation Framework for Cross-Scale Modeling and Multi-Objective Search [4.807104001943257]
Hierarchical Simulation Agent (HEAS) is a Python framework that unifies layered agent-based modeling with evolutionary optimization and tournament evaluation.<n>HEAS represents models as hierarchies of lightweight processes ("streams") scheduled in deterministic layers that read and write a shared context.<n> compact API and CLI-simulate, optimize, evaluate-expose single- and multi-objective evolution.
arXiv Detail & Related papers (2025-08-21T13:35:46Z)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.