Condensing a Sequence to One Informative Frame for Video Recognition
- URL: http://arxiv.org/abs/2201.04022v1
- Date: Tue, 11 Jan 2022 16:13:43 GMT
- Title: Condensing a Sequence to One Informative Frame for Video Recognition
- Authors: Zhaofan Qiu and Ting Yao and Yan Shu and Chong-Wah Ngo and Tao Mei
- Abstract summary: This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
- Score: 113.3056598548736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video is complex due to large variations in motion and rich content in
fine-grained visual details. Abstracting useful information from such
information-intensive media requires exhaustive computing resources. This paper
studies a two-step alternative that first condenses the video sequence to an
informative "frame" and then exploits off-the-shelf image recognition system on
the synthetic frame. A valid question is how to define "useful information" and
then distill it from a video sequence down to one synthetic frame. This paper
presents a novel Informative Frame Synthesis (IFS) architecture that
incorporates three objective tasks, i.e., appearance reconstruction, video
categorization, motion estimation, and two regularizers, i.e., adversarial
learning, color consistency. Each task equips the synthetic frame with one
ability, while each regularizer enhances its visual quality. With these, by
jointly learning the frame synthesis in an end-to-end manner, the generated
frame is expected to encapsulate the required spatio-temporal information
useful for video analysis. Extensive experiments are conducted on the
large-scale Kinetics dataset. When comparing to baseline methods that map video
sequence to a single image, IFS shows superior performance. More remarkably,
IFS consistently demonstrates evident improvements on image-based 2D networks
and clip-based 3D networks, and achieves comparable performance with the
state-of-the-art methods with less computational cost.
Related papers
- FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model.
The design of our model significantly reduces computational costs compared to other masked frame approaches.
We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z) - Three-Stage Cascade Framework for Blurry Video Frame Interpolation [23.38547327916875]
Blurry video frame (BVFI) aims to generate high-frame-rate clear videos from low-frame-rate blurry videos.
BVFI methods usually fail to fully leverage all valuable information, which ultimately hinders their performance.
We propose a simple end-to-end three-stage framework to fully explore useful information from blurry videos.
arXiv Detail & Related papers (2023-10-09T03:37:30Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - Street-view Panoramic Video Synthesis from a Single Satellite Image [92.26826861266784]
We present a novel method for synthesizing both temporally and geometrically consistent street-view panoramic video.
Existing cross-view synthesis approaches focus more on images, while video synthesis in such a case has not yet received enough attention.
arXiv Detail & Related papers (2020-12-11T20:22:38Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.