Towards Flexible Multi-modal Document Models
- URL: http://arxiv.org/abs/2303.18248v1
- Date: Fri, 31 Mar 2023 17:59:56 GMT
- Title: Towards Flexible Multi-modal Document Models
- Authors: Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, Kota
Yamaguchi
- Abstract summary: In this work, we attempt at building a holistic model that can jointly solve many different design tasks.
Our model, which we denote by FlexDM, treats vector graphic documents as a harmonious set of multi-modal elements.
Experimental results corroborate that our single FlexDM is able to successfully solve a multitude of different design tasks.
- Score: 27.955214767628107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Creative workflows for generating graphical documents involve complex
inter-related tasks, such as aligning elements, choosing appropriate fonts, or
employing aesthetically harmonious colors. In this work, we attempt at building
a holistic model that can jointly solve many different design tasks. Our model,
which we denote by FlexDM, treats vector graphic documents as a set of
multi-modal elements, and learns to predict masked fields such as element type,
position, styling attributes, image, or text, using a unified architecture.
Through the use of explicit multi-task learning and in-domain pre-training, our
model can better capture the multi-modal relationships among the different
document fields. Experimental results corroborate that our single FlexDM is
able to successfully solve a multitude of different design tasks, while
achieving performance that is competitive with task-specific and costly
baselines.
Related papers
- GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts.
We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously.
To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z) - PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM [58.67882997399021]
Our research introduces a unified framework for automated graphic layout generation.
Our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts.
We conduct extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks.
arXiv Detail & Related papers (2024-06-05T03:05:52Z) - Mixed-Query Transformer: A Unified Image Segmentation Architecture [57.32212654642384]
Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task.
We introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single set of weights.
arXiv Detail & Related papers (2024-04-06T01:54:17Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - 3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding [13.19218501758693]
The model is designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations.
We introduce new inter-grained and cross-grained loss functions to refine diverse multi-teacher knowledge distillation transfer process.
Through a comprehensive evaluation across publicly available form document understanding datasets, our proposed model consistently outperforms existing baselines.
arXiv Detail & Related papers (2024-02-28T01:56:00Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners [74.92558307689265]
We propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad')
We optimize this matching process during the training of a single model.
Experiments on the Taskonomy dataset with 13 vision tasks and the PASCAL-Context dataset with 5 vision tasks show the superiority of our approach.
arXiv Detail & Related papers (2022-12-15T18:59:52Z) - Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning [5.109216329453963]
We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
arXiv Detail & Related papers (2020-09-30T05:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.