TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial
Decoding
- URL: http://arxiv.org/abs/2110.08814v1
- Date: Sun, 17 Oct 2021 12:56:03 GMT
- Title: TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial
Decoding
- Authors: Zhengwei Wang, Qi She, Aljosa Smolic
- Abstract summary: Video compression reduces superfluous information by representing the raw video stream using the concept of Group of Pictures (GOP)
In this work, we introduce sampling the input for the network from partially decoded videos based on the GOP-level.
We demonstrate the superior performance of TEAM-Net compared to the baseline using RGB only.
- Score: 22.12530692711095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most of existing video action recognition models ingest raw RGB frames.
However, the raw video stream requires enormous storage and contains
significant temporal redundancy. Video compression (e.g., H.264, MPEG-4)
reduces superfluous information by representing the raw video stream using the
concept of Group of Pictures (GOP). Each GOP is composed of the first I-frame
(aka RGB image) followed by a number of P-frames, represented by motion vectors
and residuals, which can be regarded and used as pre-extracted features. In
this work, we 1) introduce sampling the input for the network from partially
decoded videos based on the GOP-level, and 2) propose a plug-and-play
mulTi-modal lEArning Module (TEAM) for training the network using information
from I-frames and P-frames in an end-to-end manner. We demonstrate the superior
performance of TEAM-Net compared to the baseline using RGB only. TEAM-Net also
achieves the state-of-the-art performance in the area of video action
recognition with partial decoding. Code is provided at
https://github.com/villawang/TEAM-Net.
Related papers
- When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - Local Compressed Video Stream Learning for Generic Event Boundary
Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks.
Existing methods typically require video frames to be decoded before feeding into the network.
We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z) - VNVC: A Versatile Neural Video Coding Framework for Efficient
Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels.
We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - INR-V: A Continuous Representation Space for Video-based Generative
Tasks [43.245717657048296]
We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks.
The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works.
arXiv Detail & Related papers (2022-10-29T11:54:58Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - End-to-End Compressed Video Representation Learning for Generic Event
Boundary Detection [31.31508043234419]
We propose a new end-to-end compressed video representation learning for event boundary detection.
We first use the ConvNets to extract features of the I-frames in the GOPs.
After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames.
A temporal contrastive module is proposed to determine the event boundaries of video sequences.
arXiv Detail & Related papers (2022-03-29T08:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.