Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism
- URL: http://arxiv.org/abs/2504.16761v1
- Date: Wed, 23 Apr 2025 14:33:29 GMT
- Title: Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism
- Authors: Lakshita Agarwal, Bindu Verma,
- Abstract summary: Tri-FusionNet is a novel image description generation model.<n>It integrates a Vision Transformer (ViT) encoder module with dual-attention mechanism, a BERT Approach (RoBERTa) decoder module, and a Contrastive Language-Image Pre-Training (CLIP) integrating module.<n>Results demonstrate the effectiveness of Tri-FusionNet in generating high-quality image descriptions.
- Score: 2.186901738997927
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Image description generation is essential for accessibility and AI understanding of visual content. Recent advancements in deep learning have significantly improved natural language processing and computer vision. In this work, we propose Tri-FusionNet, a novel image description generation model that integrates transformer modules: a Vision Transformer (ViT) encoder module with dual-attention mechanism, a Robustly Optimized BERT Approach (RoBERTa) decoder module, and a Contrastive Language-Image Pre-Training (CLIP) integrating module. The ViT encoder, enhanced with dual attention, focuses on relevant spatial regions and linguistic context, improving image feature extraction. The RoBERTa decoder is employed to generate precise textual descriptions. CLIP's integrating module aligns visual and textual data through contrastive learning, ensuring effective combination of both modalities. This fusion of ViT, RoBERTa, and CLIP, along with dual attention, enables the model to produce more accurate, contextually rich, and flexible descriptions. The proposed framework demonstrated competitive performance on the Flickr30k and Flickr8k datasets, with BLEU scores ranging from 0.767 to 0.456 and 0.784 to 0.479, CIDEr scores of 1.679 and 1.483, METEOR scores of 0.478 and 0.358, and ROUGE-L scores of 0.567 and 0.789, respectively. On MS-COCO, the framework obtained BLEU scores of 0.893 (B-1), 0.821 (B-2), 0.794 (B-3), and 0.725 (B-4). The results demonstrate the effectiveness of Tri-FusionNet in generating high-quality image descriptions.
Related papers
- Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation [2.186901738997927]
The proposed work introduces a novel framework for generating natural language descriptions from video datasets.
The suggested architecture makes use of ResNet50 to extract visual features from video frames.
The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model.
arXiv Detail & Related papers (2025-04-23T15:03:37Z) - CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z) - An Ensemble Model with Attention Based Mechanism for Image Captioning [1.249418440326334]
In this paper, we examine transformer models, emphasizing the critical role that attention mechanisms play.<n>The proposed model uses a transformer encoder-decoder architecture to create textual captions and a deep learning convolutional neural network to extract features from the images.<n>To create the captions, we present a novel ensemble learning framework that improves the richness of the generated captions.
arXiv Detail & Related papers (2025-01-22T12:28:37Z) - STIV: Scalable Text and Image Conditioned Video Generation [84.2574247093223]
We present a simple and scalable text-image-conditioned video generation method, named STIV.<n>Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning.<n> STIV can be easily extended to various applications, such as video prediction, frame, multi-view generation, and long video generation.
arXiv Detail & Related papers (2024-12-10T18:27:06Z) - Guided Score identity Distillation for Data-Free One-Step Text-to-Image Generation [62.30570286073223]
Diffusion-based text-to-image generation models have demonstrated the ability to produce images aligned with textual descriptions.<n>We introduce a data-free guided distillation method that enables the efficient distillation of pretrained Diffusion models without access to the real training data.<n>By exclusively training with synthetic images generated by its one-step generator, our data-free distillation method rapidly improves FID and CLIP scores, achieving state-of-the-art FID performance while maintaining a competitive CLIP score.
arXiv Detail & Related papers (2024-06-03T17:44:11Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.<n>Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for
Speech-to-Image Generation [8.26410341981427]
The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal.
We propose a single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples.
arXiv Detail & Related papers (2023-05-17T11:12:07Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Bangla Image Caption Generation through CNN-Transformer based
Encoder-Decoder Network [0.5260346080244567]
We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images.
Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions.
arXiv Detail & Related papers (2021-10-24T13:33:23Z) - Improved Bengali Image Captioning via deep convolutional neural network
based encoder-decoder model [0.8793721044482612]
This paper presents an end-to-end image captioning system utilizing a multimodal architecture.
Our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption.
arXiv Detail & Related papers (2021-02-14T16:44:17Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.