Vision Transformer Based Model for Describing a Set of Images as a Story
- URL: http://arxiv.org/abs/2210.02762v3
- Date: Fri, 14 Jul 2023 08:42:54 GMT
- Title: Vision Transformer Based Model for Describing a Set of Images as a Story
- Authors: Zainy M. Malakan and Ghulam Mubashar Hassan and Ajmal Mian
- Abstract summary: We propose a novel Vision Transformer Based Model for describing a set of images as a story.
The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT)
The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST)
- Score: 26.717033245063092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Story-Telling is the process of forming a multi-sentence story from a
set of images. Appropriately including visual variation and contextual
information captured inside the input images is one of the most challenging
aspects of visual storytelling. Consequently, stories developed from a set of
images often lack cohesiveness, relevance, and semantic relationship. In this
paper, we propose a novel Vision Transformer Based Model for describing a set
of images as a story. The proposed method extracts the distinct features of the
input images using a Vision Transformer (ViT). Firstly, input images are
divided into 16X16 patches and bundled into a linear projection of flattened
patches. The transformation from a single image to multiple image patches
captures the visual variety of the input visual patterns. These features are
used as input to a Bidirectional-LSTM which is part of the sequence encoder.
This captures the past and future image context of all image patches. Then, an
attention mechanism is implemented and used to increase the discriminatory
capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The
performance of our proposed model is evaluated using the Visual Story-Telling
dataset (VIST), and the results show that our model outperforms the current
state of the art models.
Related papers
- AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations [32.892042877725125]
Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image.
We show that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations.
We propose a new pretraining strategy to generate image variations using a large collection of image pairs.
arXiv Detail & Related papers (2024-05-23T17:58:03Z) - Pix2Gif: Motion-Guided Diffusion for GIF Generation [70.64240654310754]
We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation.
We propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts.
In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset.
arXiv Detail & Related papers (2024-03-07T16:18:28Z) - Character-Centric Story Visualization via Visual Planning and Token
Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.