AU-LLM: Micro-Expression Action Unit Detection via Enhanced LLM-Based Feature Fusion
- URL: http://arxiv.org/abs/2507.21778v1
- Date: Tue, 29 Jul 2025 13:01:59 GMT
- Title: AU-LLM: Micro-Expression Action Unit Detection via Enhanced LLM-Based Feature Fusion
- Authors: Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, Zitong Yu,
- Abstract summary: This paper introduces textbfAU-LLM, a novel framework that uses Large Language Models to detect micro-expression Action Units (AUs) in micro-expression datasets with subtle intensities and the scarcity of data.<n>We specifically address the critical vision-language semantic gap, the textbfEnhanced Fusion Projector (EFP). The EFP employs a Multi-Layer Perceptron (MLP) to intelligently fuse mid-level (local texture) and high-level (global semantics) visual features from a specialized 3D-CNN backbone into a single, information-dense token.
- Score: 26.058143518505805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The detection of micro-expression Action Units (AUs) is a formidable challenge in affective computing, pivotal for decoding subtle, involuntary human emotions. While Large Language Models (LLMs) demonstrate profound reasoning abilities, their application to the fine-grained, low-intensity domain of micro-expression AU detection remains unexplored. This paper pioneers this direction by introducing \textbf{AU-LLM}, a novel framework that for the first time uses LLM to detect AUs in micro-expression datasets with subtle intensities and the scarcity of data. We specifically address the critical vision-language semantic gap, the \textbf{Enhanced Fusion Projector (EFP)}. The EFP employs a Multi-Layer Perceptron (MLP) to intelligently fuse mid-level (local texture) and high-level (global semantics) visual features from a specialized 3D-CNN backbone into a single, information-dense token. This compact representation effectively empowers the LLM to perform nuanced reasoning over subtle facial muscle movements.Through extensive evaluations on the benchmark CASME II and SAMM datasets, including stringent Leave-One-Subject-Out (LOSO) and cross-domain protocols, AU-LLM establishes a new state-of-the-art, validating the significant potential and robustness of LLM-based reasoning for micro-expression analysis. The codes are available at https://github.com/ZS-liu-JLU/AU-LLMs.
Related papers
- Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions [3.489421154004978]
We explore the feasibility of large language models (LLMs) in detecting subtle expressions of micro-behaviors in team conversations using transcripts collected during simulated space missions.<n>Specifically, we examine zero-shot classification, fine-tuning, and paraphrase-augmented fine-tuning with encoder-only sequence classification LLMs, as well as few-shot text generation with decoder-only causal language modeling LLMs.<n>Our findings indicate that encoder-only LLMs, such as RoBERTa and DistilBERT, struggled to detect underrepresented micro-behaviors, particularly discouraging speech, even with weighted fine-tuning
arXiv Detail & Related papers (2025-06-27T23:06:24Z) - MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception [47.80768014770871]
We propose a novel Micro-Expression Large Language Model (MELLM)<n>It incorporates a subtle facial motion perception strategy with the strong inference capabilities of MLLMs.<n>Our model exhibits superior robustness and generalization capabilities in micro-expression understanding (MEU)
arXiv Detail & Related papers (2025-05-11T15:08:23Z) - MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution [36.79921476565535]
VLF-FFD is a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection.<n> EFF++ is a frame-level, explainability-driven extension of the widely used FaceForensics++ dataset.<n>VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations.
arXiv Detail & Related papers (2025-05-04T06:58:21Z) - X2-DFD: A framework for eXplainable and eXtendable Deepfake Detection [55.77552681618732]
X2-DFD is an eXplainable and eXtendable framework based on multimodal large-language models (MLLMs) for deepfake detection.<n>The first stage, Model Feature Assessment, systematically evaluates the detectability of forgery-related features for the MLLM.<n>The second stage, Explainable dataset Construction, consists of two key modules: Strong Feature Strengthening and Weak Feature Supplementing.<n>The third stage, Fine-tuning and Inference, involves fine-tuning the MLLM on the constructed dataset and deploying it for final detection and explanation.
arXiv Detail & Related papers (2024-10-08T15:28:33Z) - MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector [32.15773300068426]
Existing methods have designed various sophisticated MIA score functions to achieve considerable detection performance.
We propose MIA-Tuner, a novel instruction-based MIA method, which instructs LLMs themselves to serve as a more precise pre-training data detector.
We design two instruction-based safeguards to respectively mitigate the privacy risks brought by the existing methods and MIA-Tuner.
arXiv Detail & Related papers (2024-08-16T11:09:56Z) - Found in the Middle: How Language Models Use Long Contexts Better via
Plug-and-Play Positional Encoding [78.36702055076456]
This paper introduces Multi-scale Positional.
(Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of.
LLMs to handle relevant information located in the middle of the context.
arXiv Detail & Related papers (2024-03-05T04:58:37Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.