Beyond the Frame: Single and mutilple video summarization method with
user-defined length
- URL: http://arxiv.org/abs/2401.10254v1
- Date: Sat, 23 Dec 2023 04:32:07 GMT
- Title: Beyond the Frame: Single and mutilple video summarization method with
user-defined length
- Authors: Vahid Ahmadi Kalkhorani, Qingquan Zhang, Guanqun Song, Ting Zhu
- Abstract summary: Video summarizing is a difficult but significant work, with substantial potential for further research and development.
In this paper, we combine a variety of NLP techniques (extractive and contect-based summarizers) with video processing techniques to convert a long video into a single relatively short video.
- Score: 4.424739166856966
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video smmarization is a crucial method to reduce the time of videos which
reduces the spent time to watch/review a long video. This apporach has became
more important as the amount of publisehed video is increasing everyday. A
single or multiple videos can be summarized into a relatively short video using
various of techniques from multimodal audio-visual techniques, to natural
language processing approaches. Audiovisual techniques may be used to recognize
significant visual events and pick the most important parts, while NLP
techniques can be used to evaluate the audio transcript and extract the main
sentences (timestamps) and corresponding video frames from the original video.
Another approach is to use the best of both domain. Meaning that we can use
audio-visual cues as well as video transcript to extract and summarize the
video. In this paper, we combine a variety of NLP techniques (extractive and
contect-based summarizers) with video processing techniques to convert a long
video into a single relatively short video. We design this toll in a way that
user can specify the relative length of the summarized video. We have also
explored ways of summarizing and concatenating multiple videos into a single
short video which will help having most important concepts from the same
subject in a single short video. Out approach shows that video summarizing is a
difficult but significant work, with substantial potential for further research
and development, and it is possible thanks to the development of NLP models.
Related papers
- Step Differences in Instructional Video [34.551572600535565]
We propose an approach that generates visual instruction tuning data involving pairs of videos from HowTo100M.
We then trains a video-conditioned language model to jointly reason across multiple raw videos.
Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos.
arXiv Detail & Related papers (2024-04-24T21:49:59Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound [103.28102473127748]
We introduce an audiovisual method for long-range text-to-video retrieval.
Our approach aims to retrieve minute-long videos that capture complex human actions.
Our method is 2.92x faster and 2.34x memory-efficient than long-range video-only approaches.
arXiv Detail & Related papers (2022-04-06T14:43:42Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z) - Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.