MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution
- URL: http://arxiv.org/abs/2506.14511v1
- Date: Tue, 17 Jun 2025 13:35:06 GMT
- Title: MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution
- Authors: Zhiwen Shao, Yifan Cheng, Feiran Li, Yong Zhou, Xuequan Lu, Yuan Xie, Lizhuang Ma,
- Abstract summary: Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions.<n>We propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution.<n>Our framework outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks.
- Score: 46.600316142855334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs. The code is available at https://github.com/CYF-cuber/MOL.
Related papers
- MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning [8.021031339658492]
Compositional Zero-Shot Learning aims to recognize unseen state-object combinations by leveraging known combinations.<n>Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features.<n>We propose a Multi-Stage Cross-modal Interaction model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder.
arXiv Detail & Related papers (2025-05-15T13:36:42Z) - MambaMIC: An Efficient Baseline for Microscopic Image Classification with State Space Models [12.182070604073585]
We propose a vision backbone for Microscopic Image Classification (MIC) tasks, named MambaMIC.<n> Specifically, we introduce a Local-Global dual-branch aggregation module: the MambaMIC Block.<n>In the local branch, we use local convolutions to capture pixel similarity, mitigating local pixel forgetting and enhancing perception.<n>In the global branch, SSM extracts global dependencies, while Locally Aware Enhanced Filter reduces channel redundancy and local pixel forgetting.
arXiv Detail & Related papers (2024-09-12T10:01:33Z) - Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization.
The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z) - Micro-Expression Recognition by Motion Feature Extraction based on Pre-training [6.015288149235598]
We propose a novel motion extraction strategy (MoExt) for the micro-expression recognition task.
In MoExt, shape features and texture features are first extracted separately from onset and apex frames, and then motion features related to MEs are extracted based on shape features of both frames.
The effectiveness of proposed method is validated on three commonly used datasets.
arXiv Detail & Related papers (2024-07-10T03:51:34Z) - Mesh Denoising Transformer [104.5404564075393]
Mesh denoising is aimed at removing noise from input meshes while preserving their feature structures.
SurfaceFormer is a pioneering Transformer-based mesh denoising framework.
New representation known as Local Surface Descriptor captures local geometric intricacies.
Denoising Transformer module receives the multimodal information and achieves efficient global feature aggregation.
arXiv Detail & Related papers (2024-05-10T15:27:43Z) - Precise Knowledge Transfer via Flow Matching [24.772381404849174]
We name this framework Knowledge Transfer with Flow Matching (FM-KT)
FM-KT can be integrated with a metric-based distillation method with any form (textite.g. vanilla KD, DKD, PKD and DIST)
We empirically validate the scalability and state-of-the-art performance of our proposed methods among relevant comparison approaches.
arXiv Detail & Related papers (2024-02-03T03:59:51Z) - GAFlow: Incorporating Gaussian Attention into Optical Flow [62.646389181507764]
We push Gaussian Attention (GA) into the optical flow models to accentuate local properties during representation learning.
We introduce a novel Gaussian-Constrained Layer (GCL) which can be easily plugged into existing Transformer blocks.
For reliable motion analysis, we provide a new Gaussian-Guided Attention Module (GGAM)
arXiv Detail & Related papers (2023-09-28T07:46:01Z) - Feature Representation Learning with Adaptive Displacement Generation
and Transformer Fusion for Micro-Expression Recognition [18.6490971645882]
Micro-expressions are spontaneous, rapid and subtle facial movements that can neither be forged nor suppressed.
We propose a novel framework Feature Representation Learning with adaptive Displacement Generation and Transformer fusion (FRL-DGT)
Experiments with solid leave-one-subject-out (LOSO) evaluation results have demonstrated the superiority of our proposed FRL-DGT to state-of-the-art methods.
arXiv Detail & Related papers (2023-04-10T07:03:36Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.