T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
- URL: http://arxiv.org/abs/2411.19951v2
- Date: Mon, 02 Dec 2024 06:54:47 GMT
- Title: T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
- Authors: Shukang Yin, Chaoyou Fu, Sirui Zhao, Yunhang Shen, Chunjiang Ge, Yan Yang, Zuwei Long, Yuhan Dai, Tong Xu, Xing Sun, Ran He, Caifeng Shan, Enhong Chen,
- Abstract summary: We develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus.
We find that the proposed scheme can boost the performance of long video understanding without training with long video samples.
- Score: 102.66246727371583
- License:
- Abstract: The success of Multimodal Large Language Models (MLLMs) in the image domain has garnered wide attention from the research community. Drawing on previous successful experiences, researchers have recently explored extending the success to the video understanding realms. Apart from training from scratch, an efficient way is to utilize the pre-trained image-LLMs, leading to two mainstream approaches, i.e. zero-shot inference and further fine-tuning with video data. In this work, our study of these approaches harvests an effective data augmentation method. We first make a deeper inspection of the zero-shot inference way and identify two limitations, i.e. limited generalization and lack of temporal understanding capabilities. Thus, we further investigate the fine-tuning approach and find a low learning efficiency when simply using all the video data samples, which can be attributed to a lack of instruction diversity. Aiming at this issue, we develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus. Integrating these data enables a simple and efficient training scheme, which achieves performance comparable to or even superior to using full video datasets by training with just 15% the sample size. Meanwhile, we find that the proposed scheme can boost the performance of long video understanding without training with long video samples. We hope our study will spark more thinking about using MLLMs for video understanding and curation of high-quality data. The code is released at https://github.com/xjtupanda/T2Vid.
Related papers
- Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval.
In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning.
In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z) - Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Weakly Supervised Two-Stage Training Scheme for Deep Video Fight
Detection Model [0.0]
Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media.
Previous work has largely relied on action recognition techniques to tackle this problem.
We design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator.
arXiv Detail & Related papers (2022-09-23T08:29:16Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.