WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation
- URL: http://arxiv.org/abs/2510.15306v1
- Date: Fri, 17 Oct 2025 04:37:37 GMT
- Title: WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation
- Authors: Kuang-Da Wang, Zhao Wang, Yotaro Shimose, Wei-Yao Wang, Shingo Takamatsu,
- Abstract summary: WebGen-V is a new benchmark and framework for instruction-to-HTML generation that enhances data quality and evaluation.<n>WebGen-V contributes three key innovations: (1) an unbounded and agentic crawling framework that continuously collects real-world webpages; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, andformatted text and image assets; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment.
- Score: 12.981748587257194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Witnessed by the recent advancements on leveraging LLM for coding and multimodal understanding, we present WebGen-V, a new benchmark and framework for instruction-to-HTML generation that enhances both data quality and evaluation granularity. WebGen-V contributes three key innovations: (1) an unbounded and extensible agentic crawling framework that continuously collects real-world webpages and can leveraged to augment existing benchmarks; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, and JSON-formatted text and image assets, explicit alignment between content, layout, and visual components for detailed multimodal supervision; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment. Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of our structured data and section-wise evaluation, as well as the contribution of each component. To the best of our knowledge, WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from real-world data acquisition and webpage generation to structured multimodal assessment.
Related papers
- DuoGen: Towards General Purpose Interleaved Multimodal Generation [65.13479486098419]
DuoGen is a general-purpose interleaved generation framework that addresses data curation, architecture design, and evaluation.<n>We build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites.<n>A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences.
arXiv Detail & Related papers (2026-01-31T04:35:15Z) - DAVE: A VLM Vision Encoder for Document Understanding and Web Agents [50.05119785399764]
We introduce DAVE, a vision encoder purpose-built for Vision-language models (VLMs)<n>Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images.<n>We use ensemble training to fuse features from pretrained generalist encoders with our own document and web-specific representations.
arXiv Detail & Related papers (2025-12-19T04:09:24Z) - Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z) - RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering [50.42577862494645]
We present RAG-IGBench, a benchmark designed to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering.<n>RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content.
arXiv Detail & Related papers (2025-10-11T03:06:39Z) - WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning [24.178675410636135]
We present a large-scale benchmark of 45.1k webpages collected from real-world portal sites.<n>We also propose a novel evaluation metric that measures layout and style consistency from the final rendered pages.
arXiv Detail & Related papers (2025-10-05T08:47:39Z) - UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets [51.284864284520744]
Unified vision large language models (VLLMs) have recently achieved impressive advancements in both multimodal understanding and generation.<n>We introduce a novel dataset construction framework, UnifiedVisual, and present UnifiedVisual-240K.<n>UnifiedVisual-240K seamlessly integrates diverse visual and textual inputs and outputs, enabling comprehensive cross-modal reasoning.
arXiv Detail & Related papers (2025-09-18T08:39:44Z) - A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends [11.428017294202162]
Visually-Rich Document Understanding (VRDU) has emerged as a critical field, driven by the need to automatically process documents containing complex visual, textual, and layout information.<n>This survey reviews recent advancements in MLLM-based VRDU, highlighting three core components.
arXiv Detail & Related papers (2025-07-14T02:10:31Z) - CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design [6.830055289299306]
CAL-RAG is a retrieval-augmented, agentic framework for content-aware layout generation.<n>We implement our framework using LangGraph and evaluate it on a benchmark rich in semantic variability.<n>Results demonstrate that combining retrieval augmentation with agentic multi-step reasoning yields a scalable, interpretable, and high-fidelity solution.
arXiv Detail & Related papers (2025-06-27T06:09:56Z) - PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides [51.88536367177796]
We propose a two-stage, edit-based approach inspired by human drafts for automatically generating presentations.<n>PWTAgent first analyzes references to extract slide-level functional types and content schemas, then generates editing actions based on selected reference slides.<n>PWTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
arXiv Detail & Related papers (2025-01-07T16:53:01Z) - DOGR: Towards Versatile Visual Document Grounding and Referring [47.66205811791444]
Grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction.<n>We propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data.<n>Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types.
arXiv Detail & Related papers (2024-11-26T05:38:34Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs [49.91550773480978]
This paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details.<n>To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison.<n>The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs.
arXiv Detail & Related papers (2024-04-09T15:05:48Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.