Hierarchical Multimodal Transformer to Summarize Videos
- URL: http://arxiv.org/abs/2109.10559v1
- Date: Wed, 22 Sep 2021 07:38:59 GMT
- Title: Hierarchical Multimodal Transformer to Summarize Videos
- Authors: Bin Zhao, Maoguo Gong, Xuelong Li
- Abstract summary: Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
- Score: 103.47766795086206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although video summarization has achieved tremendous success benefiting from
Recurrent Neural Networks (RNN), RNN-based methods neglect the global
dependencies and multi-hop relationships among video frames, which limits the
performance. Transformer is an effective model to deal with this problem, and
surpasses RNN-based methods in several sequence modeling tasks, such as machine
translation, video captioning, \emph{etc}. Motivated by the great success of
transformer and the natural structure of video (frame-shot-video), a
hierarchical transformer is developed for video summarization, which can
capture the dependencies among frame and shots, and summarize the video by
exploiting the scene information formed by shots. Furthermore, we argue that
both the audio and visual information are essential for the video summarization
task. To integrate the two kinds of information, they are encoded in a
two-stream scheme, and a multimodal fusion mechanism is developed based on the
hierarchical transformer. In this paper, the proposed method is denoted as
Hierarchical Multimodal Transformer (HMT). Practically, extensive experiments
show that HMT surpasses most of the traditional, RNN-based and attention-based
video summarization methods.
Related papers
- Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Multimodal Frame-Scoring Transformer for Video Summarization [4.266320191208304]
Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames.
MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders.
MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores.
arXiv Detail & Related papers (2022-07-05T05:14:15Z) - Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - Video Joint Modelling Based on Hierarchical Transformer for
Co-summarization [0.0]
Video summarization aims to automatically generate a summary (storyboard or video skim) of a video, which can facilitate large-scale video retrieving and browsing.
Most of the existing methods perform video summarization on individual videos, which neglects the correlations among similar videos.
We propose Video Joint Modelling based on Hierarchical Transformer (VJMHT) for co-summarization.
arXiv Detail & Related papers (2021-12-27T01:54:35Z) - HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
Retrieval [40.646628490887075]
We propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval.
HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results.
Inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly.
arXiv Detail & Related papers (2021-03-28T04:52:25Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.