Towards Explainable AI: Multi-Modal Transformer for Video-based Image   Description Generation
        - URL: http://arxiv.org/abs/2504.16788v1
 - Date: Wed, 23 Apr 2025 15:03:37 GMT
 - Title: Towards Explainable AI: Multi-Modal Transformer for Video-based Image   Description Generation
 - Authors: Lakshita Agarwal, Bindu Verma, 
 - Abstract summary: The proposed work introduces a novel framework for generating natural language descriptions from video datasets.<n>The suggested architecture makes use of ResNet50 to extract visual features from video frames.<n>The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model.
 - Score: 2.186901738997927
 - License: http://creativecommons.org/licenses/by-nc-nd/4.0/
 - Abstract:   Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model's efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI. 
 
       
      
        Related papers
        - Towards an Automated Multimodal Approach for Video Summarization:   Building a Bridge Between Text, Audio and Facial Cue-Based Summarization [8.688428251722911]
This paper proposes a behaviour-aware multimodal video summarization framework.<n>It integrates textual, audio, and visual cues to generate timestamp-aligned summaries.
arXiv  Detail & Related papers  (2025-06-30T10:41:33Z) - VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv  Detail & Related papers  (2025-05-28T19:39:35Z) - LOVE: Benchmarking and Evaluating Text-to-Video Generation and   Video-to-Text Interpretation [46.994391428519776]
We present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation.<n>We propose LOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy in terms of both instance level and model level.
arXiv  Detail & Related papers  (2025-05-17T17:49:26Z) - Tri-FusionNet: Enhancing Image Description Generation with   Transformer-based Fusion Network and Dual Attention Mechanism [2.186901738997927]
Tri-FusionNet is a novel image description generation model.<n>It integrates a Vision Transformer (ViT) encoder module with dual-attention mechanism, a BERT Approach (RoBERTa) decoder module, and a Contrastive Language-Image Pre-Training (CLIP) integrating module.<n>Results demonstrate the effectiveness of Tri-FusionNet in generating high-quality image descriptions.
arXiv  Detail & Related papers  (2025-04-23T14:33:29Z) - VTD-CLIP: Video-to-Text Discretization via Prompting CLIP [44.51452778561945]
Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks.<n>Existing approaches rely primarily on parameter-efficient fine-tuning of image-text pre-trained models.<n>We propose a video-to-text discretization framework to address limited interpretability and poor generalization due to inadequate temporal modeling.
arXiv  Detail & Related papers  (2025-03-24T07:27:19Z) - InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios.
Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv  Detail & Related papers  (2024-11-25T14:27:50Z) - T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video   Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.<n>We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv  Detail & Related papers  (2024-07-19T17:58:36Z) - AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated
  by AI [1.1035305628305816]
This paper introduces AIGCBench, a pioneering comprehensive benchmark designed to evaluate a variety of video generation tasks.
A varied and open-domain image-text dataset that evaluates different state-of-the-art algorithms under equivalent conditions.
We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models.
arXiv  Detail & Related papers  (2024-01-03T10:08:40Z) - Videoprompter: an ensemble of foundational models for zero-shot video
  understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv  Detail & Related papers  (2023-10-23T19:45:46Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
  Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv  Detail & Related papers  (2023-10-08T03:35:27Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
  with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv  Detail & Related papers  (2022-12-31T11:36:53Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv  Detail & Related papers  (2022-01-11T16:13:43Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv  Detail & Related papers  (2020-01-16T08:28:57Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.