Learning UI-to-Code Reverse Generator Using Visual Critic Without
Rendering
- URL: http://arxiv.org/abs/2305.14637v2
- Date: Fri, 3 Nov 2023 06:10:33 GMT
- Title: Learning UI-to-Code Reverse Generator Using Visual Critic Without
Rendering
- Authors: Davit Soselia, Khalid Saifullah, and Tianyi Zhou
- Abstract summary: We propose a novel vision-code transformer (ViCT) composed of a vision encoder processing the screenshots and a language decoder to generate the code.
They are by pre-trained models but aligning the two modalities requires end-to-end finetuning.
ViCT can achieve comparable performance as when using a larger decoder such as LLaMA.
- Score: 18.74127660489501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated reverse engineering of HTML/CSS code from UI screenshots is an
important yet challenging problem with broad applications in website
development and design. In this paper, we propose a novel vision-code
transformer (ViCT) composed of a vision encoder processing the screenshots and
a language decoder to generate the code. They are initialized by pre-trained
models such as ViT/DiT and GPT-2/LLaMA but aligning the two modalities requires
end-to-end finetuning, which aims to minimize the visual discrepancy between
the code-rendered webpage and the original screenshot. However, the rendering
is non-differentiable and causes costly overhead. We address this problem by
actor-critic fine-tuning where a visual critic without rendering (ViCR) is
developed to predict visual discrepancy given the original and generated code.
To train and evaluate our models, we created two synthetic datasets of varying
complexity, with over 75,000 unique (code, screenshot) pairs. We evaluate the
UI-to-Code performance using a combination of automated metrics such as MSE,
BLEU, IoU, and a novel htmlBLEU score. ViCT outperforms a strong baseline model
DiT-GPT2, improving IoU from 0.64 to 0.79 and lowering MSE from 12.25 to 9.02.
With much lower computational cost, it can achieve comparable performance as
when using a larger decoder such as LLaMA.
Related papers
- UniMatch V2: Pushing the Limit of Semi-Supervised Semantic Segmentation [26.91063423376469]
Semi-supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images.
We present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak-to-strong consistency from V1.
arXiv Detail & Related papers (2024-10-14T17:49:27Z) - Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach [51.522121376987634]
We propose DCGen, a divide-and-based approach to automate the translation of webpage design to UI code.
DCGen starts by dividing screenshots into manageable segments, generating descriptions for each segment, and then reassembling them into complete UI code for the entire screenshot.
We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 14% improvement in visual similarity over competing methods.
arXiv Detail & Related papers (2024-06-24T07:58:36Z) - A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis [9.687982148528187]
Convolutional Neural Networks (CNNs) are currently among the best texture analysis approaches.
Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition.
This work explores various pre-trained ViT architectures when transferred to tasks that rely on textures.
arXiv Detail & Related papers (2024-06-10T09:48:13Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - Design2Code: How Far Are We From Automating Front-End Engineering? [83.06100360864502]
We formalize this as a Design2Code task and conduct comprehensive benchmarking.
Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases.
We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision.
Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models.
arXiv Detail & Related papers (2024-03-05T17:56:27Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.