An optimized Capsule-LSTM model for facial expression recognition with
video sequences
- URL: http://arxiv.org/abs/2106.07564v1
- Date: Thu, 27 May 2021 10:08:05 GMT
- Title: An optimized Capsule-LSTM model for facial expression recognition with
video sequences
- Authors: Siwei Liu (1), Yuanpeng Long (2), Gao Xu (1), Lijia Yang (1), Shimei
Xu (3), Xiaoming Yao (1,3), Kunxian Shu (1) ((1) School of Computer Science
and Technology, Chongqing Key Laboratory on Big Data for Bio Intelligence,
Chongqing University of Posts and Telecommunications, Chongqing, China, (2)
School of Economic Information Engineering, Southwestern University of
Finance and Economics, Chengdu, China, (3) 51yunjian.com, Hetie International
Square, Chengdu, Sichuan, China)
- Abstract summary: The model is composed of three networks includingcapsule encoders, capsule decoders and LSTM network.
The experimental results from the MMI dataset show that the Capsule-LSTM model can effectively improve the accuracy of video expression recognition.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To overcome the limitations of convolutional neural network in the process of
facial expression recognition, a facial expression recognition model
Capsule-LSTM based on video frame sequence is proposed. This model is composed
of three networks includingcapsule encoders, capsule decoders and LSTM network.
The capsule encoder extracts the spatial information of facial expressions in
video frames. Capsule decoder reconstructs the images to optimize the network.
LSTM extracts the temporal information between video frames and analyzes the
differences in expression changes between frames. The experimental results from
the MMI dataset show that the Capsule-LSTM model proposed in this paper can
effectively improve the accuracy of video expression recognition.
Related papers
- xLSTM-FER: Enhancing Student Expression Recognition with Extended Vision Long Short-Term Memory Network [0.8287206589886881]
This paper introduces xLSTM-FER, a novel architecture derived from the Extended Long Short-Term Memory (xLSTM)
xLSTM-FER processes input images by segmenting them into a series of patches and leveraging a stack of xLSTM blocks to handle these patches.
Experiments on CK+, RAF-DF, and FERplus demonstrate the potential of xLSTM-FER in expression recognition tasks.
arXiv Detail & Related papers (2024-10-07T14:29:24Z) - Exploring Effective Mask Sampling Modeling for Neural Image Compression [171.35596121939238]
Most existing neural image compression methods rely on side information from hyperprior or context models to eliminate spatial redundancy.
Inspired by the mask sampling modeling in recent self-supervised learning methods for natural language processing and high-level vision, we propose a novel pretraining strategy for neural image compression.
Our method achieves competitive performance with lower computational complexity compared to state-of-the-art image compression methods.
arXiv Detail & Related papers (2023-06-09T06:50:20Z) - An Image captioning algorithm based on the Hybrid Deep Learning
Technique (CNN+GRU) [0.0]
We present a CNN-GRU encoder decode framework for caption-to-image reconstructor.
It handles the semantic context into consideration as well as the time complexity.
The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.
arXiv Detail & Related papers (2023-01-06T10:00:06Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Multi-Modal Zero-Shot Sign Language Recognition [51.07720650677784]
We propose a multi-modal Zero-Shot Sign Language Recognition model.
A Transformer-based model along with a C3D model is used for hand detection and deep features extraction.
A semantic space is used to map the visual features to the lingual embedding of the class labels.
arXiv Detail & Related papers (2021-09-02T09:10:39Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - Noisy-LSTM: Improving Temporal Awareness for Video Semantic Segmentation [29.00635219317848]
This paper presents a new model named Noisy-LSTM, which is trainable in an end-to-end manner.
We also present a simple yet effective training strategy, which replaces a frame in video sequence with noises.
arXiv Detail & Related papers (2020-10-19T13:08:15Z) - Beyond Single Stage Encoder-Decoder Networks: Deep Decoders for Semantic
Image Segmentation [56.44853893149365]
Single encoder-decoder methodologies for semantic segmentation are reaching their peak in terms of segmentation quality and efficiency per number of layers.
We propose a new architecture based on a decoder which uses a set of shallow networks for capturing more information content.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
arXiv Detail & Related papers (2020-07-19T18:44:34Z) - NITS-VC System for VATEX Video Captioning Challenge 2020 [16.628598778804403]
We employ an encoder-decoder based approach in which the visual features of the video are encoded using 3D convolutional neural network (C3D)
Our model is able to achieve BLEU scores of 0.20 and 0.22 on public and private test data sets respectively.
arXiv Detail & Related papers (2020-06-07T06:39:56Z) - Dual Convolutional LSTM Network for Referring Image Segmentation [18.181286443737417]
referring image segmentation is a problem at the intersection of computer vision and natural language understanding.
We propose a dual convolutional LSTM (ConvLSTM) network to tackle this problem.
arXiv Detail & Related papers (2020-01-30T20:40:18Z) - An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond
Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding.
We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern.
By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.