MAMS: Model-Agnostic Module Selection Framework for Video Captioning
- URL: http://arxiv.org/abs/2501.18269v1
- Date: Thu, 30 Jan 2025 11:10:18 GMT
- Title: MAMS: Model-Agnostic Module Selection Framework for Video Captioning
- Authors: Sangho Lee, Il Yong Chun, Hogun Park,
- Abstract summary: Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges.
This paper proposes the first model-agnostic module selection framework in video captioning.
Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.
- Score: 11.442879458679144
- License:
- Abstract: Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.
Related papers
- BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects.
We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance.
We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z) - VidCtx: Context-aware Video Question Answering with Image Models [15.1350316858766]
We introduce VidCtx, a novel training-free VideoQA framework which integrates both visual information from input frames and textual descriptions of others frames.
Experiments show that VidCtx achieves competitive performance among approaches that rely on open models.
arXiv Detail & Related papers (2024-12-23T09:26:38Z) - Multimodal Contextualized Support for Enhancing Video Retrieval System [0.0]
We propose a system that integrates a novel retrieval pipeline that extracts multimodal data, and incorporate information from multiple frames within a video.
The pipeline captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.
arXiv Detail & Related papers (2024-12-10T15:20:23Z) - Exploring the Design Space of Visual Context Representation in Video MLLMs [102.11582556690388]
Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks.
Visual context representation refers to the scheme to select frames from a video and further select the tokens from a frame.
In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes.
arXiv Detail & Related papers (2024-10-17T15:59:52Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - A Challenging Multimodal Video Summary: Simultaneously Extracting and
Generating Keyframe-Caption Pairs from Video [20.579167394855197]
This paper proposes a practical multimodal video summarization task setting and dataset to train and evaluate the task.
The target task involves summarizing a given video into a number ofcaption pairs and displaying them in a listable format to grasp the video content quickly.
This task is useful as a practical application and presents a highly challenging problem worthy of study.
arXiv Detail & Related papers (2023-12-04T02:17:14Z) - VideoBooth: Diffusion-based Video Generation with Image Prompts [130.47771531413375]
We propose a feed-forward framework for video generation with image prompts.
VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts.
arXiv Detail & Related papers (2023-12-01T18:55:40Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.