Related papers: Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

URL: http://arxiv.org/abs/2504.16788v1
Date: Wed, 23 Apr 2025 15:03:37 GMT
Title: Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation
Authors: Lakshita Agarwal, Bindu Verma,
Abstract summary: The proposed work introduces a novel framework for generating natural language descriptions from video datasets.<n>The suggested architecture makes use of ResNet50 to extract visual features from video frames.<n>The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model.
Score: 2.186901738997927
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model's efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.

Related papers

Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization [8.688428251722911]
This paper proposes a behaviour-aware multimodal video summarization framework.<n>It integrates textual, audio, and visual cues to generate timestamp-aligned summaries.
arXiv Detail & Related papers (2025-06-30T10:41:33Z)
VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z)
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation [46.994391428519776]
We present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation.<n>We propose LOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy in terms of both instance level and model level.
arXiv Detail & Related papers (2025-05-17T17:49:26Z)
Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism [2.186901738997927]
Tri-FusionNet is a novel image description generation model.<n>It integrates a Vision Transformer (ViT) encoder module with dual-attention mechanism, a BERT Approach (RoBERTa) decoder module, and a Contrastive Language-Image Pre-Training (CLIP) integrating module.<n>Results demonstrate the effectiveness of Tri-FusionNet in generating high-quality image descriptions.
arXiv Detail & Related papers (2025-04-23T14:33:29Z)
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP [44.51452778561945]
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks.<n>Existing approaches rely primarily on parameter-efficient fine-tuning of image-text pre-trained models.<n>We propose a video-to-text discretization framework to address limited interpretability and poor generalization due to inadequate temporal modeling.
arXiv Detail & Related papers (2025-03-24T07:27:19Z)
InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios. Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z)
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.<n>We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI [1.1035305628305816]
This paper introduces AIGCBench, a pioneering comprehensive benchmark designed to evaluate a variety of video generation tasks. A varied and open-domain image-text dataset that evaluates different state-of-the-art algorithms under equivalent conditions. We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models.
arXiv Detail & Related papers (2024-01-03T10:08:40Z)
Videoprompter: an ensemble of foundational models for zero-shot video understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z)
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge. We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z)
Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame. IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z)
Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.