Related papers: FaceRNET: a Facial Expression Intensity Estimation Network

Related papers

MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment [5.922172844641853]
This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding.<n>MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection.<n> Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance.
arXiv Detail & Related papers (2025-06-12T07:32:51Z)
Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders. We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z)
UniLearn: Enhancing Dynamic Facial Expression Recognition through Unified Pre-Training and Fine-Tuning on Images and Videos [83.48170683672427]
UniLearn is a unified learning paradigm that integrates static facial expression recognition data to enhance DFER task. UniLearn consistently state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively.
arXiv Detail & Related papers (2024-09-10T01:57:57Z)
Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z)
Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition [6.995226697189459]
We employ a multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks. We release our pre-trained models as well as source code publicly.
arXiv Detail & Related papers (2024-04-16T20:51:36Z)
Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM) This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z)
Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images. We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy. Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision. We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency. EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z)
MINTIME: Multi-Identity Size-Invariant Video Deepfake Detection [17.74528571088335]
We introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes. It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people.
arXiv Detail & Related papers (2022-11-20T15:17:24Z)
AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation [24.81870045216019]
Temporal action proposal generation (TAPG) is a challenging task, which requires localizing action intervals in an untrimmed video. We propose to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net) Our AOE-Net consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM)
arXiv Detail & Related papers (2022-10-05T21:57:25Z)
Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects. We tackle this problem from two different angles: algorithm and dataset. We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z)
Video-based Facial Expression Recognition using Graph Convolutional Networks [57.980827038988735]
We introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based facial expression recognition. We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0.
arXiv Detail & Related papers (2020-10-26T07:31:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.