Faster and Accurate Compressed Video Action Recognition Straight from
the Frequency Domain
- URL: http://arxiv.org/abs/2012.13726v1
- Date: Sat, 26 Dec 2020 12:43:53 GMT
- Title: Faster and Accurate Compressed Video Action Recognition Straight from
the Frequency Domain
- Authors: Samuel Felipe dos Santos and Jurandy Almeida
- Abstract summary: Deep learning has been successfully used to learn powerful and interpretable features for recognizing human actions in videos.
Most of the existing deep learning approaches have been designed for processing video information as RGB image sequences.
We propose a deep neural network capable of learning straight from compressed video.
- Score: 1.9214041945441434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human action recognition has become one of the most active field of research
in computer vision due to its wide range of applications, like surveillance,
medical, industrial environments, smart homes, among others. Recently, deep
learning has been successfully used to learn powerful and interpretable
features for recognizing human actions in videos. Most of the existing deep
learning approaches have been designed for processing video information as RGB
image sequences. For this reason, a preliminary decoding process is required,
since video data are often stored in a compressed format. However, a high
computational load and memory usage is demanded for decoding a video. To
overcome this problem, we propose a deep neural network capable of learning
straight from compressed video. Our approach was evaluated on two public
benchmarks, the UCF-101 and HMDB-51 datasets, demonstrating comparable
recognition performance to the state-of-the-art methods, with the advantage of
running up to 2 times faster in terms of inference speed.
Related papers
- DMVC: Multi-Camera Video Compression Network aimed at Improving Deep Learning Accuracy [22.871591373774802]
We introduce a cutting-edge video compression framework tailored for the age of ubiquitous video data.
Unlike traditional compression methods that prioritize human visual perception, our innovative approach focuses on preserving semantic information critical for deep learning accuracy.
Based on a designed deep learning algorithms, it adeptly segregates essential information from redundancy, ensuring machine learning tasks are fed with data of the highest relevance.
arXiv Detail & Related papers (2024-10-24T03:29:57Z) - Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - Deep Neural Networks in Video Human Action Recognition: A Review [21.00217656391331]
Video behavior recognition is one of the most foundational tasks of computer vision.
Deep neural networks are built for recognizing pixel-level information such as images with RGB, RGB-D, or optical flow formats.
In our article, the performance of deep neural networks surpassed most of the techniques in the feature learning and extraction tasks.
arXiv Detail & Related papers (2023-05-25T03:54:41Z) - High Speed Human Action Recognition using a Photonic Reservoir Computer [1.7403133838762443]
We introduce a new training method for the reservoir computer, based on "Timesteps Of Interest"
We solve the task with high accuracy and speed, to the point of allowing for processing multiple video streams in real time.
arXiv Detail & Related papers (2023-05-24T16:04:42Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z) - Speeding Up Action Recognition Using Dynamic Accumulation of Residuals
in Compressed Domain [2.062593640149623]
Temporal redundancy and the sheer size of raw videos are the two most common problematic issues related to video processing algorithms.
This paper presents an approach for using residual data, available in compressed videos directly, which can be obtained by a light partially decoding procedure.
Applying neural networks exclusively for accumulated residuals in the compressed domain accelerates performance, while the classification results are highly competitive with raw video approaches.
arXiv Detail & Related papers (2022-09-29T13:08:49Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Video Action Recognition Using spatio-temporal optical flow video frames [0.0]
There are many problems associated with recognizing human actions in videos.
This paper focus on spatial and temporal pattern recognition for the classification of videos using Deep Neural Networks.
The final recognition accuracy was about 94%.
arXiv Detail & Related papers (2021-02-05T19:46:49Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.