Does Video Summarization Require Videos? Quantifying the Effectiveness
of Language in Video Summarization
- URL: http://arxiv.org/abs/2309.09405v1
- Date: Mon, 18 Sep 2023 00:08:49 GMT
- Title: Does Video Summarization Require Videos? Quantifying the Effectiveness
of Language in Video Summarization
- Authors: Yoonsoo Nam, Adam Lehavi, Daniel Yang, Digbalay Bose, Swabha
Swayamdipta, Shrikanth Narayanan
- Abstract summary: Video summarization remains a huge challenge in computer vision due to the size of the input videos to be summarized.
We propose an efficient, language-only video summarizer that achieves competitive accuracy with high data efficiency.
- Score: 37.09662541127891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video summarization remains a huge challenge in computer vision due to the
size of the input videos to be summarized. We propose an efficient,
language-only video summarizer that achieves competitive accuracy with high
data efficiency. Using only textual captions obtained via a zero-shot approach,
we train a language transformer model and forego image representations. This
method allows us to perform filtration amongst the representative text vectors
and condense the sequence. With our approach, we gain explainability with
natural language that comes easily for human interpretation and textual
summaries of the videos. An ablation study that focuses on modality and data
compression shows that leveraging text modality only effectively reduces input
data processing while retaining comparable results.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video [22.60291297308379]
We investigate the feasibility in transforming the video summarization task into a Natural Language Processing (NLP) task.
Our method achieves state-of-the-art performance on the SumMe dataset in rank correlation coefficients.
arXiv Detail & Related papers (2024-05-14T18:07:04Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.