Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks
- URL: http://arxiv.org/abs/2602.08057v1
- Date: Sun, 08 Feb 2026 17:02:55 GMT
- Title: Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks
- Authors: Yufei Wang, Haixu Liu, Tianxiang Xu, Chuancheng Shi, Hongsheng Xing,
- Abstract summary: This paper proposes a weak-vision framework to tackle the automatic recognition of "concealed emotions" in videos.<n>Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts from under 0.6 in prior work to over 0.69.
- Score: 4.888851550406879
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an "MLP-ified" key-point backbone can match - or even surpass - GCN-based counterparts in this task.
Related papers
- Detect Anything via Next Point Prediction [51.55967987350882]
Rex- Omni is a 3B-scale MLLM that achieves state-of-the-art object perception performance.<n>On benchmarks like COCO and LVIS, Rex- Omni attains performance comparable to or exceeding regression-based models.
arXiv Detail & Related papers (2025-10-14T17:59:54Z) - UniVid: The Open-Source Unified Video Model [41.15980565061684]
We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter.<n>Experiments on standard benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-09-29T02:31:36Z) - A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features [8.419663258260671]
We introduce an end-to-end network that performs early fusion of offline images and online stroke data.<n>Our approach achieves state-of-the-art accuracy, exceeding previous bests by up to 1%.
arXiv Detail & Related papers (2025-06-25T08:58:47Z) - GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation [13.071227081328288]
Apparent personality analysis from short videos poses significant chal-lenges due to the complex interplay of visual, auditory, and textual cues.<n>In this paper, we propose GAME, a Graph-Augmented Multimodalvolution are designed to robustly model and fuse multi-source features for automatic personality prediction.<n>For the visual stream, we construct a facial graph and introduce a dual-branch Geo Two-Stream Network, which combines Graph Convolutional Networks (GCNs) and Convolutional Neural Net-works (CNNs)<n>To capture temporal dynamics, frame-level features are processed by a BiG
arXiv Detail & Related papers (2025-05-05T13:48:09Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - Retinal IPA: Iterative KeyPoints Alignment for Multimodal Retinal Imaging [11.70130626541926]
We propose a novel framework for learning cross-modality features to enhance matching and registration across multi-modality retinal images.
Our model draws on the success of previous learning-based feature detection and description methods.
It is trained in a self-supervised manner by enforcing segmentation consistency between different augmentations of the same image.
arXiv Detail & Related papers (2024-07-25T19:51:27Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences.
We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision.
Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z) - BEST: BERT Pre-Training for Sign Language Recognition with Coupling
Tokenization [135.73436686653315]
We are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition( SLR) model.
Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone.
Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence.
It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state.
arXiv Detail & Related papers (2023-02-10T06:23:44Z) - Holistically-Attracted Wireframe Parsing: From Supervised to
Self-Supervised Learning [112.54086514317021]
This article presents HolisticDally-Attracted Wireframe Parsing 2 method for geometric analysis using line segments and junctions.
The proposed HAWP consists of three components empowered by end-to-form 4D labels.
arXiv Detail & Related papers (2022-10-24T06:39:32Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.