Temporal Stochastic Softmax for 3D CNNs: An Application in Facial
Expression Recognition
- URL: http://arxiv.org/abs/2011.05227v1
- Date: Tue, 10 Nov 2020 16:40:00 GMT
- Title: Temporal Stochastic Softmax for 3D CNNs: An Application in Facial
Expression Recognition
- Authors: Th\'eo Ayral, Marco Pedersoli, Simon Bacon and Eric Granger
- Abstract summary: We present a strategy for efficient video-based training of 3D CNNs.
It relies on softmax temporal pooling and a weighted sampling mechanism to select the most relevant training clips.
- Score: 11.517316695930596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training deep learning models for accurate spatiotemporal recognition of
facial expressions in videos requires significant computational resources. For
practical reasons, 3D Convolutional Neural Networks (3D CNNs) are usually
trained with relatively short clips randomly extracted from videos. However,
such uniform sampling is generally sub-optimal because equal importance is
assigned to each temporal clip. In this paper, we present a strategy for
efficient video-based training of 3D CNNs. It relies on softmax temporal
pooling and a weighted sampling mechanism to select the most relevant training
clips. The proposed softmax strategy provides several advantages: a reduced
computational complexity due to efficient clip sampling, and an improved
accuracy since temporal weighting focuses on more relevant clips during both
training and inference. Experimental results obtained with the proposed method
on several facial expression recognition benchmarks show the benefits of
focusing on more informative clips in training videos. In particular, our
approach improves performance and computational cost by reducing the impact of
inaccurate trimming and coarse annotation of videos, and heterogeneous
distribution of visual information across time.
Related papers
- PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding [19.006364251731753]
PESFormer is a model based on the vision transformer architecture to achieve point-to-interval expression spotting.
PESFormer employs a direct timestamp encoding (DTE) approach to replace anchors, enabling binary classification of each timestamp.
We implement a strategy that involves zero-padding the untrimmed training videos to create uniform, longer videos of a predetermined duration.
arXiv Detail & Related papers (2024-10-24T12:45:25Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Spatio-Temporal Crop Aggregation for Video Representation Learning [33.296154476701055]
Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone.
We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
arXiv Detail & Related papers (2022-11-30T14:43:35Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Optimization Planning for 3D ConvNets [123.43419144051703]
It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme.
We decompose the path into a series of training "states" and specify the hyper- parameters, e.g., learning rate and the length of input clips, in each state.
We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path.
arXiv Detail & Related papers (2022-01-11T16:13:31Z) - Video Summarization through Reinforcement Learning with a 3D
Spatio-Temporal U-Net [15.032516344808526]
We introduce 3DST-UNet-RL framework for video summarization.
We show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks.
The proposed video summarization has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis.
arXiv Detail & Related papers (2021-06-19T16:27:19Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks [79.01176229586855]
We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
arXiv Detail & Related papers (2020-11-23T15:40:15Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - An Information-rich Sampling Technique over Spatio-Temporal CNN for
Classification of Human Actions in Videos [5.414308305392762]
We propose a novel scheme for human action recognition in videos, using a 3-dimensional Convolutional Neural Network (3D CNN) based classifier.
In this paper, a 3D CNN architecture is proposed to extract featuresweighted and follows Long Short-Term Memory (LSTM) to recognize human actions.
Experiments are performed with KTH and WEIZMANN human actions datasets, whereby it is shown to produce comparable results with the state-of-the-art techniques.
arXiv Detail & Related papers (2020-02-06T05:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.