Related papers: Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

URL: http://arxiv.org/abs/2506.21855v1
Date: Fri, 27 Jun 2025 02:18:10 GMT
Title: Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation
Authors: Jiho Choi, Sang Jun Lee,
Abstract summary: We propose a method that learns a general representation of periodic signals from unlabeled facial videos by capturing subtle changes in skin tone over time.<n>We evaluate the proposed method on the PURE, U-BFCr, MMPD, and V-BFC4V datasets.<n>Our results demonstrate significant performance improvements, particularly in challenging cross-dataset evaluations.
Score: 6.32655874508904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a method that learns a general representation of periodic signals from unlabeled facial videos by capturing subtle changes in skin tone over time. The proposed framework employs the video masked autoencoder to learn a high-dimensional spatio-temporal representation of the facial region through self-supervised learning. Capturing quasi-periodic signals in the video is crucial for remote photoplethysmography (rPPG) estimation. To account for signal periodicity, we apply frame masking in terms of video sampling, which allows the model to capture resampled quasi-periodic signals during the pre-training stage. Moreover, the framework incorporates physiological bandlimit constraints, leveraging the property that physiological signals are sparse within their frequency bandwidth to provide pulse cues to the model. The pre-trained encoder is then transferred to the rPPG task, where it is used to extract physiological signals from facial videos. We evaluate the proposed method through extensive experiments on the PURE, UBFC-rPPG, MMPD, and V4V datasets. Our results demonstrate significant performance improvements, particularly in challenging cross-dataset evaluations. Our code is available at https://github.com/ziiho08/Periodic-MAE.

Related papers

PSDNorm: Test-Time Temporal Normalization for Deep Learning in Sleep Staging [63.05435596565677]
We propose PSDNorm that leverages Monge mapping and temporal context to normalize feature maps in deep learning models for signals.<n> PSDNorm achieves state-of-the-art performance on unseen left-out datasets while being 4-times more data-efficient than BatchNorm.
arXiv Detail & Related papers (2025-03-06T16:20:25Z)
CodePhys: Robust Video-based Remote Physiological Measurement through Latent Codebook Querying [26.97093819822487]
Remote photoplethysmography aims to measure non-contact physiological signals from facial videos.<n>Most existing methods directly extract video-based r features by designing neural networks for heart rate estimation.<n>Recent methods are easily affected by interference and degradation, resulting in noisy r signals.<n>We propose a novel method named CodePhys, which innovatively treats r measurement as a code task in a noise-free proxy space.
arXiv Detail & Related papers (2025-02-11T13:05:42Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement [26.480515954528848]
We propose a novel framework that successfully integrates popular vision-language models into a remote physiological measurement task.<n>We develop a series of generative and contrastive learning mechanisms to optimize the framework.<n>Our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities.
arXiv Detail & Related papers (2024-07-11T13:45:50Z)
SiNC+: Adaptive Camera-Based Vitals with Unsupervised Learning of Periodic Signals [6.458510829614774]
We present the first non-contrastive unsupervised learning framework for signal regression. We find that encouraging sparse power spectra within normal physiological bandlimits and variance over batches of power spectra is sufficient for learning periodic signals.
arXiv Detail & Related papers (2024-04-20T19:17:40Z)
Non-Contrastive Unsupervised Learning of Physiological Signals from Video [4.8327232174895745]
We present the first non-contrastive unsupervised learning framework for signal regression to break free from labelled video data. With minimal assumptions of periodicity and finite bandwidth, our approach is capable of discovering blood volume pulse directly from unlabelled videos.
arXiv Detail & Related papers (2023-03-14T14:34:51Z)
SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples. In addition, we propose a temporal warping to cover the complex temporal variation in videos. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z)
Facial Video-based Remote Physiological Measurement via Self-supervised Learning [9.99375728024877]
We introduce a novel framework that learns to estimate r signals from facial videos without the need of ground truth signals. Negative samples are generated via a learnable frequency module, which performs nonlinear signal frequency transformation. Next, we introduce a local r expert aggregation module to estimate r signals from augmented samples. It encodes complementary pulsation information from different face regions and aggregate them into one r prediction.
arXiv Detail & Related papers (2022-10-27T13:03:23Z)
Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images. This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z)
Masked Frequency Modeling for Self-Supervised Visual Pre-Training [102.89756957704138]
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
arXiv Detail & Related papers (2022-06-15T17:58:30Z)
TCGL: Temporal Contrastive Graph for Self-supervised Video Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL) Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG) To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z)
PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer [55.936527926778695]
Recent deep learning approaches focus on mining subtle r clues using convolutional neural networks with limited-temporal receptive fields. In this paper, we propose the PhysFormer, an end-to-end video transformer based architecture.
arXiv Detail & Related papers (2021-11-23T18:57:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.