Dense Multitask Learning to Reconfigure Comics
- URL: http://arxiv.org/abs/2307.08071v1
- Date: Sun, 16 Jul 2023 15:10:34 GMT
- Title: Dense Multitask Learning to Reconfigure Comics
- Authors: Deblina Bhattacharjee, Sabine S\"usstrunk and Mathieu Salzmann
- Abstract summary: We develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels.
Our method can successfully identify the semantic units as well as the notion of 3D in comic panels.
- Score: 63.367664789203936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we develop a MultiTask Learning (MTL) model to achieve dense
predictions for comics panels to, in turn, facilitate the transfer of comics
from one publication channel to another by assisting authors in the task of
reconfiguring their narratives. Our MTL method can successfully identify the
semantic units as well as the embedded notion of 3D in comic panels. This is a
significantly challenging problem because comics comprise disparate artistic
styles, illustrations, layouts, and object scales that depend on the authors
creative process. Typically, dense image-based prediction techniques require a
large corpus of data. Finding an automated solution for dense prediction in the
comics domain, therefore, becomes more difficult with the lack of ground-truth
dense annotations for the comics images. To address these challenges, we
develop the following solutions: 1) we leverage a commonly-used strategy known
as unsupervised image-to-image translation, which allows us to utilize a large
corpus of real-world annotations; 2) we utilize the results of the translations
to develop our multitasking approach that is based on a vision transformer
backbone and a domain transferable attention module; 3) we study the
feasibility of integrating our MTL dense-prediction method with an existing
retargeting method, thereby reconfiguring comics.
Related papers
- One missing piece in Vision and Language: A Survey on Comics Understanding [13.766672321462435]
This survey is the first to propose a task-oriented framework for comics intelligence.
It aims to guide future research by addressing critical gaps in data availability and task definition.
arXiv Detail & Related papers (2024-09-14T18:26:26Z) - Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks.
We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z) - VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model [76.02314305164595]
This work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users.
We take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image.
In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts.
arXiv Detail & Related papers (2024-06-03T07:14:19Z) - Re-Thinking Inverse Graphics With Large Language Models [51.333105116400205]
Inverse graphics -- inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics.
We propose the Inverse-Graphics Large Language Model (IG-LLM), an inversegraphics framework centered around an LLM.
We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training.
arXiv Detail & Related papers (2024-04-23T16:59:02Z) - Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion [35.25298023240529]
We propose a novel zero-shot approach to identify characters and predict speaker names based solely on unannotated comic images.
Our method requires no training data or annotations, it can be used as-is on any comic series.
arXiv Detail & Related papers (2024-04-22T08:59:35Z) - Multimodal Transformer for Comics Text-Cloze [8.616858272810084]
Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels.
Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations.
We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants.
arXiv Detail & Related papers (2024-03-06T14:11:45Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - Fully Context-Aware Image Inpainting with a Learned Semantic Pyramid [102.24539566851809]
Restoring reasonable and realistic content for arbitrary missing regions in images is an important yet challenging task.
Recent image inpainting models have made significant progress in generating vivid visual details, but they can still lead to texture blurring or structural distortions.
We propose the Semantic Pyramid Network (SPN) motivated by the idea that learning multi-scale semantic priors can greatly benefit the recovery of locally missing content in images.
arXiv Detail & Related papers (2021-12-08T04:33:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.