Related papers: WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation

WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation

URL: http://arxiv.org/abs/2510.15306v1
Date: Fri, 17 Oct 2025 04:37:37 GMT
Title: WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation
Authors: Kuang-Da Wang, Zhao Wang, Yotaro Shimose, Wei-Yao Wang, Shingo Takamatsu,
Abstract summary: WebGen-V is a new benchmark and framework for instruction-to-HTML generation that enhances data quality and evaluation.<n>WebGen-V contributes three key innovations: (1) an unbounded and agentic crawling framework that continuously collects real-world webpages; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, andformatted text and image assets; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment.
Score: 12.981748587257194
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Witnessed by the recent advancements on leveraging LLM for coding and multimodal understanding, we present WebGen-V, a new benchmark and framework for instruction-to-HTML generation that enhances both data quality and evaluation granularity. WebGen-V contributes three key innovations: (1) an unbounded and extensible agentic crawling framework that continuously collects real-world webpages and can leveraged to augment existing benchmarks; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, and JSON-formatted text and image assets, explicit alignment between content, layout, and visual components for detailed multimodal supervision; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment. Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of our structured data and section-wise evaluation, as well as the contribution of each component. To the best of our knowledge, WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from real-world data acquisition and webpage generation to structured multimodal assessment.

Related papers

DuoGen: Towards General Purpose Interleaved Multimodal Generation [65.13479486098419]
DuoGen is a general-purpose interleaved generation framework that addresses data curation, architecture design, and evaluation.<n>We build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites.<n>A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences.
arXiv Detail & Related papers (2026-01-31T04:35:15Z)
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents [50.05119785399764]
We introduce DAVE, a vision encoder purpose-built for Vision-language models (VLMs)<n>Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images.<n>We use ensemble training to fuse features from pretrained generalist encoders with our own document and web-specific representations.
arXiv Detail & Related papers (2025-12-19T04:09:24Z)
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z)
RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering [50.42577862494645]
We present RAG-IGBench, a benchmark designed to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering.<n>RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content.
arXiv Detail & Related papers (2025-10-11T03:06:39Z)
WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning [24.178675410636135]
We present a large-scale benchmark of 45.1k webpages collected from real-world portal sites.<n>We also propose a novel evaluation metric that measures layout and style consistency from the final rendered pages.
arXiv Detail & Related papers (2025-10-05T08:47:39Z)
UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets [51.284864284520744]
Unified vision large language models (VLLMs) have recently achieved impressive advancements in both multimodal understanding and generation.<n>We introduce a novel dataset construction framework, UnifiedVisual, and present UnifiedVisual-240K.<n>UnifiedVisual-240K seamlessly integrates diverse visual and textual inputs and outputs, enabling comprehensive cross-modal reasoning.
arXiv Detail & Related papers (2025-09-18T08:39:44Z)
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends [11.428017294202162]
Visually-Rich Document Understanding (VRDU) has emerged as a critical field, driven by the need to automatically process documents containing complex visual, textual, and layout information.<n>This survey reviews recent advancements in MLLM-based VRDU, highlighting three core components.
arXiv Detail & Related papers (2025-07-14T02:10:31Z)
CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design [6.830055289299306]
CAL-RAG is a retrieval-augmented, agentic framework for content-aware layout generation.<n>We implement our framework using LangGraph and evaluate it on a benchmark rich in semantic variability.<n>Results demonstrate that combining retrieval augmentation with agentic multi-step reasoning yields a scalable, interpretable, and high-fidelity solution.
arXiv Detail & Related papers (2025-06-27T06:09:56Z)
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides [51.88536367177796]
We propose a two-stage, edit-based approach inspired by human drafts for automatically generating presentations.<n>PWTAgent first analyzes references to extract slide-level functional types and content schemas, then generates editing actions based on selected reference slides.<n>PWTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
arXiv Detail & Related papers (2025-01-07T16:53:01Z)
DOGR: Towards Versatile Visual Document Grounding and Referring [47.66205811791444]
Grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction.<n>We propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data.<n>Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types.
arXiv Detail & Related papers (2024-11-26T05:38:34Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs [49.91550773480978]
This paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details.<n>To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison.<n>The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs.
arXiv Detail & Related papers (2024-04-09T15:05:48Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.