Semi-Parametric Video-Grounded Text Generation
- URL: http://arxiv.org/abs/2301.11507v1
- Date: Fri, 27 Jan 2023 03:00:43 GMT
- Title: Semi-Parametric Video-Grounded Text Generation
- Authors: Sungdong Kim, Jin-Hwa Kim, Jiyoung Lee, Minjoon Seo
- Abstract summary: In this paper, we propose a semi-parametric video-grounded text generation model, SeViT.
Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames.
Experimental results demonstrate our method has a significant advantage in longer videos and causal video understanding.
- Score: 21.506377836451577
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient video-language modeling should consider the computational cost
because of a large, sometimes intractable, number of video frames. Parametric
approaches such as the attention mechanism may not be ideal since its
computational cost quadratically increases as the video length increases.
Rather, previous studies have relied on offline feature extraction or frame
sampling to represent the video efficiently, focusing on cross-modal modeling
in short video clips. In this paper, we propose a semi-parametric
video-grounded text generation model, SeViT, a novel perspective on scalable
video-language modeling toward long untrimmed videos. Treating a video as an
external data store, SeViT includes a non-parametric frame retriever to select
a few query-relevant frames from the data store for a given query and a
parametric generator to effectively aggregate the frames with the query via
late fusion methods. Experimental results demonstrate our method has a
significant advantage in longer videos and causal video understanding.
Moreover, our model achieves the new state of the art on four video-language
datasets, iVQA (+4.8), Next-QA (+6.9), and Activitynet-QA (+4.8) in accuracy,
and MSRVTT-Caption (+3.6) in CIDEr.
Related papers
- xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization [115.64739269488965]
VimTS enhances the generalization ability of the model by achieving better synergy among different tasks.
We propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm.
For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2.
arXiv Detail & Related papers (2024-04-30T15:49:03Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.