Multimodal Transformer for Comics Text-Cloze
- URL: http://arxiv.org/abs/2403.03719v1
- Date: Wed, 6 Mar 2024 14:11:45 GMT
- Title: Multimodal Transformer for Comics Text-Cloze
- Authors: Emanuele Vivoli, Joan Lafuente Baeza, Ernest Valveny Llobet,
Dimosthenis Karatzas
- Abstract summary: Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels.
Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations.
We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants.
- Score: 8.616858272810084
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This work explores a closure task in comics, a medium where visual and
textual elements are intricately intertwined. Specifically, Text-cloze refers
to the task of selecting the correct text to use in a comic panel, given its
neighboring panels. Traditional methods based on recurrent neural networks have
struggled with this task due to limited OCR accuracy and inherent model
limitations. We introduce a novel Multimodal Large Language Model
(Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving
a 10% improvement over existing state-of-the-art models in both its easy and
hard variants. Central to our approach is a Domain-Adapted ResNet-50 based
visual encoder, fine-tuned to the comics domain in a self-supervised manner
using SimCLR. This encoder delivers comparable results to more complex models
with just one-fifth of the parameters. Additionally, we release new OCR
annotations for this dataset, enhancing model input quality and resulting in
another 1% improvement. Finally, we extend the task to a generative format,
establishing new baselines and expanding the research possibilities in the
field of comics analysis.
Related papers
- Multi-modal Generation via Cross-Modal In-Context Learning [50.45304937804883]
We propose a Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that generates novel images from complex multimodal prompt sequences.
Our MGCC demonstrates a diverse range of multimodal capabilities, like novel image generation, the facilitation of multimodal dialogue, and generation of texts.
arXiv Detail & Related papers (2024-05-28T15:58:31Z) - Dense Multitask Learning to Reconfigure Comics [63.367664789203936]
We develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels.
Our method can successfully identify the semantic units as well as the notion of 3D in comic panels.
arXiv Detail & Related papers (2023-07-16T15:10:34Z) - Text Reading Order in Uncontrolled Conditions by Sparse Graph
Segmentation [71.40119152422295]
We propose a lightweight, scalable and generalizable approach to identify text reading order.
The model is language-agnostic and runs effectively across multi-language datasets.
It is small enough to be deployed on virtually any platform including mobile devices.
arXiv Detail & Related papers (2023-05-04T06:21:00Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Continuous Offline Handwriting Recognition using Deep Learning Models [0.0]
Handwritten text recognition is an open problem of great interest in the area of automatic document image analysis.
We have proposed a new recognition model based on integrating two types of deep learning architectures: convolutional neural networks (CNN) and sequence-to-sequence (seq2seq)
The new proposed model provides competitive results with those obtained with other well-established methodologies.
arXiv Detail & Related papers (2021-12-26T07:31:03Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.