Related papers: WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

URL: http://arxiv.org/abs/2506.07818v1
Date: Mon, 09 Jun 2025 14:46:02 GMT
Title: WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code
Authors: Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, Xuelong Li,
Abstract summary: Multimodal Large Language Models (MLLMs) have the potential to act as AI software engineers capable of executing complex web application development.<n>Existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes.<n>We propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code.
Score: 57.45181837786448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.

Related papers

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation [31.237236649603123]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering.<n>DesignBench is a benchmark for assessing MLLMs' capabilities in automated front-end engineering.
arXiv Detail & Related papers (2025-06-06T17:21:21Z)
WebGames: Challenging General-Purpose Web-Browsing AI Agents [11.320069795732058]
WebGames is a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents.<n>We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance.<n>Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%.
arXiv Detail & Related papers (2025-02-25T16:45:08Z)
An efficient approach to represent enterprise web application structure using Large Language Model in the service of Intelligent Quality Engineering [0.0]
This paper presents a novel approach to represent enterprise web application structures using Large Language Models (LLMs)<n>We introduce a hierarchical representation methodology that optimize the few-shot learning capabilities of LLMs.<n>Our methodology addresses existing challenges around usage of Generative AI techniques in automated software testing.
arXiv Detail & Related papers (2025-01-12T15:10:57Z)
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems. This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks. evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z)
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering [74.99736967448423]
We construct Design2Code - the first real-world benchmark for this task.<n>We manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics.<n>Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.
arXiv Detail & Related papers (2024-03-05T17:56:27Z)
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models [65.18602126334716]
Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots. We introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups.
arXiv Detail & Related papers (2024-01-25T03:33:18Z)
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z)
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences. We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.