Multi-modal Video Chapter Generation
- URL: http://arxiv.org/abs/2209.12694v1
- Date: Mon, 26 Sep 2022 13:44:48 GMT
- Title: Multi-modal Video Chapter Generation
- Authors: Xiao Cao, Zitan Chen, Canyu Le, Lei Meng
- Abstract summary: We introduce a new dataset called Chapter-Gen, which consists of approximately 10k user-generated videos with annotated chapter information.
Our data collection procedure is fast, scalable and does not require any additional manual annotation.
Our experiments demonstrate that the proposed framework achieves superior results over existing methods.
- Score: 11.658507807110645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chapter generation becomes practical technique for online videos nowadays.
The chapter breakpoints enable users to quickly find the parts they want and
get the summative annotations. However, there is no public method and dataset
for this task. To facilitate the research along this direction, we introduce a
new dataset called Chapter-Gen, which consists of approximately 10k
user-generated videos with annotated chapter information. Our data collection
procedure is fast, scalable and does not require any additional manual
annotation. On top of this dataset, we design an effective baseline specificlly
for video chapters generation task. which captures two aspects of a
video,including visual dynamics and narration text. It disentangles local and
global video features for localization and title generation respectively. To
parse the long video efficiently, a skip sliding window mechanism is designed
to localize potential chapters. And a cross attention multi-modal fusion module
is developed to aggregate local features for title generation. Our experiments
demonstrate that the proposed framework achieves superior results over existing
methods which illustrate that the method design for similar task cannot be
transfered directly even after fine-tuning. Code and dataset are available at
https://github.com/czt117/MVCG.
Related papers
- PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation [15.9587266448337]
Video instance segmentation requires detecting, segmenting, and tracking objects in videos.
This paper introduces a method that eliminates video annotations by utilizing image datasets.
arXiv Detail & Related papers (2024-06-28T05:22:39Z) - VidChapters-7M: Video Chapters at Scale [110.19323390486775]
We present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total.
VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters.
We show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings.
arXiv Detail & Related papers (2023-09-25T08:38:11Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Visual Subtitle Feature Enhanced Video Outline Generation [23.831220964676973]
We introduce a novel video understanding task, namely video outline generation (VOG)
To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG.
We propose a Visual Subtitle feature Enhanced video outline generation model (VSENet)
arXiv Detail & Related papers (2022-08-24T05:26:26Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.