DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization
- URL: http://arxiv.org/abs/2105.06441v1
- Date: Thu, 13 May 2021 17:33:26 GMT
- Title: DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization
- Authors: Safa Messaoud, Ismini Lourentzou, Assma Boughoula, Mona Zehni, Zhizhen
Zhao, Chengxiang Zhai, Alexander G. Schwing
- Abstract summary: We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
- Score: 127.16984421969529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent growth of web video sharing platforms has increased the demand for
systems that can efficiently browse, retrieve and summarize video content.
Query-aware multi-video summarization is a promising technique that caters to
this demand. In this work, we introduce a novel Query-Aware Hierarchical
Pointer Network for Multi-Video Summarization, termed DeepQAMVS, that jointly
optimizes multiple criteria: (1) conciseness, (2) representativeness of
important query-relevant events and (3) chronological soundness. We design a
hierarchical attention model that factorizes over three distributions, each
collecting evidence from a different modality, followed by a pointer network
that selects frames to include in the summary. DeepQAMVS is trained with
reinforcement learning, incorporating rewards that capture representativeness,
diversity, query-adaptability and temporal coherence. We achieve
state-of-the-art results on the MVS1K dataset, with inference time scaling
linearly with the number of input video frames.
Related papers
- MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Multilevel Hierarchical Network with Multiscale Sampling for Video
Question Answering [16.449212284367366]
We propose a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA.
MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR)
With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations.
PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels.
arXiv Detail & Related papers (2022-05-09T06:28:56Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - CONQUER: Contextual Query-aware Ranking for Video Corpus Moment
Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
We propose a novel model for effective moment localization and ranking.
We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.