VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface
Modeling
- URL: http://arxiv.org/abs/2112.05692v1
- Date: Fri, 10 Dec 2021 17:37:26 GMT
- Title: VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface
Modeling
- Authors: Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, Alexey Gritsenko
- Abstract summary: VUT is a Versatile UI Transformer that takes multimodal input and simultaneously accomplishes 5 distinct tasks with the same model.
Our model consists of a multimodal Transformer encoder that jointly encodes UI images and structures, and performs UI object detection when the UI structures are absent in the input.
- Score: 11.569380762858815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: User interface modeling is inherently multimodal, which involves several
distinct types of data: images, structures and language. The tasks are also
diverse, including object detection, language generation and grounding. In this
paper, we present VUT, a Versatile UI Transformer that takes multimodal input
and simultaneously accomplishes 5 distinct tasks with the same model. Our model
consists of a multimodal Transformer encoder that jointly encodes UI images and
structures, and performs UI object detection when the UI structures are absent
in the input. Our model also consists of an auto-regressive Transformer model
that encodes the language input and decodes output, for both question-answering
and command grounding with respect to the UI. Our experiments show that for
most of the tasks, when trained jointly for multi-tasks, VUT substantially
reduces the number of models and footprints needed for performing multiple
tasks, while achieving accuracy exceeding or on par with baseline models
trained for each individual task.
Related papers
- DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion [81.7514869897233]
We develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object.
MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion.
MuLan also adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt.
arXiv Detail & Related papers (2024-02-20T06:14:30Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Task-Based MoE for Multitask Multilingual Machine Translation [58.20896429151824]
Mixture-of-experts (MoE) architecture has been proven a powerful method for diverse tasks in training deep models in many applications.
In this work, we design a novel method that incorporates task information into MoE models at different granular levels with shared dynamic task-based adapters.
arXiv Detail & Related papers (2023-08-30T05:41:29Z) - InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene
Understanding [11.608682595506354]
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model.
Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions.
We propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context.
arXiv Detail & Related papers (2023-06-08T00:28:22Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Transformer is All You Need: Multimodal Multitask Learning with a
Unified Transformer [24.870827400461682]
We propose a Unified Transformer model to simultaneously learn the most prominent tasks across different domains.
Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task.
The entire model is jointly trained end-to-end with losses from each task.
arXiv Detail & Related papers (2021-02-22T04:45:06Z) - DynE: Dynamic Ensemble Decoding for Multi-Document Summarization [5.197307534263253]
We propose a simple decoding methodology which ensembles the output of multiple instances of the same model on different inputs.
We obtain state-of-the-art results on several multi-document summarization datasets.
arXiv Detail & Related papers (2020-06-15T20:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.