Related papers: Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

URL: http://arxiv.org/abs/2411.03292v1
Date: Tue, 05 Nov 2024 17:40:03 GMT
Title: Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?
Authors: Jingyu Xiao, Yuxuan Wan, Yintong Huo, Zhiyao Xu, Michael R. Lyu,
Abstract summary: We present the first systematic investigation of multi-modal large language models (MLLMs) in generating interactive webpages. Specifically, we first formulate the Interaction-to-Code task and build the Interaction2Code benchmark. We then conduct comprehensive experiments on three state-of-the-art (SOTA) MLLMs using both automatic metrics and human evaluations.
Score: 30.540795619470483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Converting webpage design into functional UI code is a critical step for building websites, which can be labor-intensive and time-consuming. To automate this design-to-code transformation process, various automated methods using learning-based networks and multi-modal large language models (MLLMs) have been proposed. However, these studies were merely evaluated on a narrow range of static web pages and ignored dynamic interaction elements, making them less practical for real-world website deployment. To fill in the blank, we present the first systematic investigation of MLLMs in generating interactive webpages. Specifically, we first formulate the Interaction-to-Code task and build the Interaction2Code benchmark that contains 97 unique web pages and 213 distinct interactions, spanning 15 webpage types and 30 interaction categories. We then conduct comprehensive experiments on three state-of-the-art (SOTA) MLLMs using both automatic metrics and human evaluations, thereby summarizing six findings accordingly. Our experimental results highlight the limitations of MLLMs in generating fine-grained interactive features and managing interactions with complex transformations and subtle visual modifications. We further analyze failure cases and their underlying causes, identifying 10 common failure types and assessing their severity. Additionally, our findings reveal three critical influencing factors, i.e., prompts, visual saliency, and textual descriptions, that can enhance the interaction generation performance of MLLMs. Based on these findings, we elicit implications for researchers and developers, providing a foundation for future advancements in this field. Datasets and source code are available at https://github.com/WebPAI/Interaction2Code.

Related papers

Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning [27.511627003202538]
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes.<n>This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes.<n>We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components.
arXiv Detail & Related papers (2025-05-14T04:04:23Z)
Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation [25.770855154106453]
We introduce an Efficient Explicit Joint-level Interaction Model (EJIM) for generating text-guided human-object interactions. EJIM features a Dual-branch HOI Mamba that separately and efficiently models human object and motions. We show that EJIM surpasses previous works by a large margin while using only 5% of the inference time.
arXiv Detail & Related papers (2025-03-29T15:23:21Z)
WebNav: An Intelligent Agent for Voice-Controlled Web Navigation [0.0]
WebNav is a novel agent for multi-modal web navigation.<n>System combines vision-based context from screenshots with a dynamic DOM-labeling browser extension.
arXiv Detail & Related papers (2025-03-18T02:33:27Z)
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents [23.1522773245956]
We introduce a novel paradigm that augments language agents with model-based planning. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities.
arXiv Detail & Related papers (2024-11-10T18:50:51Z)
The Compressor-Retriever Architecture for Language Model OS [20.56093501980724]
This paper explores the concept of using a language model as the core component of an operating system (OS) A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions. We introduce compressor-retriever, a model-agnostic architecture designed for life-long context management.
arXiv Detail & Related papers (2024-09-02T23:28:15Z)
Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction. The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? [115.60866817774641]
Multimodal Large Language models (MLLMs) have shown promise in web-related tasks. evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. bench is a multimodal benchmark designed to assess the capabilities of MLLMs across a variety of web tasks.
arXiv Detail & Related papers (2024-04-09T02:29:39Z)
On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment. We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
Eliciting Model Steering Interactions from Users via Data and Visual Design Probes [8.45602005745865]
Domain experts increasingly use automated data science tools to incorporate machine learning (ML) models in their work but struggle to " codify" these models when they are incorrect. For these experts, semantic interactions can provide an accessible avenue to guide and refine ML models without having to dive into its technical details. This study examines how experts with a spectrum of ML expertise use semantic interactions to update a simple classification model.
arXiv Detail & Related papers (2023-10-12T20:34:02Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning [24.87615615489849]
We present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. We propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes.
arXiv Detail & Related papers (2023-07-18T17:56:06Z)
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks [60.38369406877899]
Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data. transformer models excel in handling long dependencies between input sequence elements and enable parallel processing. Our survey encompasses the identification of the top five application domains for transformer-based models.
arXiv Detail & Related papers (2023-06-11T23:13:51Z)
Evaluating Human-Language Model Interaction [79.33022878034627]
We develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems. We design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. We find that better non-interactive performance does not always translate to better human-LM interaction.
arXiv Detail & Related papers (2022-12-19T18:59:45Z)
VIRT: Improving Representation-based Models for Text Matching through Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models. VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.