Landmark Guided Visual Feature Extractor for Visual Speech Recognition with Limited Resource
- URL: http://arxiv.org/abs/2508.07233v1
- Date: Sun, 10 Aug 2025 08:26:55 GMT
- Title: Landmark Guided Visual Feature Extractor for Visual Speech Recognition with Limited Resource
- Authors: Lei Yang, Junshan Jin, Mingyuan Zhang, Yi He, Bofan Chen, Shilin Wang,
- Abstract summary: Visual speech recognition is a technique to identify spoken content in silent speech.<n>Deep learning methods can be effected by visual disturbances, such as lightning conditions.<n>This paper proposes a landmark guided visual feature extractor.
- Score: 24.004478804309763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual speech recognition is a technique to identify spoken content in silent speech videos, which has raised significant attention in recent years. Advancements in data-driven deep learning methods have significantly improved both the speed and accuracy of recognition. However, these deep learning methods can be effected by visual disturbances, such as lightning conditions, skin texture and other user-specific features. Data-driven approaches could reduce the performance degradation caused by these visual disturbances using models pretrained on large-scale datasets. But these methods often require large amounts of training data and computational resources, making them costly. To reduce the influence of user-specific features and enhance performance with limited data, this paper proposed a landmark guided visual feature extractor. Facial landmarks are used as auxiliary information to aid in training the visual feature extractor. A spatio-temporal multi-graph convolutional network is designed to fully exploit the spatial locations and spatio-temporal features of facial landmarks. Additionally, a multi-level lip dynamic fusion framework is introduced to combine the spatio-temporal features of the landmarks with the visual features extracted from the raw video frames. Experimental results show that this approach performs well with limited data and also improves the model's accuracy on unseen speakers.
Related papers
- Learning Robust Spatial Representations from Binaural Audio through Feature Distillation [64.36563387033921]
We investigate the use of a pretraining stage based on feature distillation to learn a robust spatial representation of speech without the need for data labels.<n>Our experiments demonstrate that the pretrained models show improved performance in noisy and reverberant environments.
arXiv Detail & Related papers (2025-08-28T15:43:15Z) - Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events [25.348660233701708]
Event camera records data with high temporal resolution and wide dynamic range.<n>Event data is inherently sparse and noisy, mainly reflecting brightness changes.<n>We propose a self-supervised pre-training framework to fully reveal latent information in event data.
arXiv Detail & Related papers (2025-08-07T15:38:36Z) - Adaptive Masking Enhances Visual Grounding [12.793586888511978]
We propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, to enhance vocabulary grounding in low-shot learning scenarios.
We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks.
arXiv Detail & Related papers (2024-10-04T05:48:02Z) - Attribute-Aware Representation Rectification for Generalized Zero-Shot
Learning [19.65026043141699]
Generalized Zero-shot Learning (GZSL) has yielded remarkable performance by designing a series of unbiased visual-semantics mappings.
We propose a simple yet effective Attribute-Aware Representation Rectification framework for GZSL, dubbed $mathbf(AR)2$.
arXiv Detail & Related papers (2023-11-23T11:30:32Z) - Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling [14.88236554564287]
In this work, we build upon advances in unsupervised learning by incorporating information about the structure of a scene into the training process.
We achieve this by (1) learning depth-feature correlation by spatially correlate the feature maps with the depth maps to induce knowledge about the structure of the scene.
We then implement farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene.
arXiv Detail & Related papers (2023-09-21T11:47:01Z) - DeepVisualInsight: Time-Travelling Visualization for Spatio-Temporal
Causality of Deep Classification Training [7.4940788786485095]
We propose a time-travelling visual solution DeepVisualInsight aiming to manifest causality while training a deep learning image.
We show how gradient-descent sampling techniques can influence and reshape the layout of learnt input representation and the boundaries in consecutive epochs.
Our experiments show that, comparing to baseline approaches, we achieve the best visualization performance regarding the spatial/temporal properties and visualization efficiency.
arXiv Detail & Related papers (2021-12-31T07:05:31Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Progressive Spatio-Temporal Bilinear Network with Monte Carlo Dropout
for Landmark-based Facial Expression Recognition with Uncertainty Estimation [93.73198973454944]
The performance of our method is evaluated on three widely used datasets.
It is comparable to that of video-based state-of-the-art methods while it has much less complexity.
arXiv Detail & Related papers (2021-06-08T13:40:30Z) - Data Augmentation for Object Detection via Differentiable Neural
Rendering [71.00447761415388]
It is challenging to train a robust object detector when annotated data is scarce.
Existing approaches to tackle this problem include semi-supervised learning that interpolates labeled data from unlabeled data.
We introduce an offline data augmentation method for object detection, which semantically interpolates the training data with novel views.
arXiv Detail & Related papers (2021-03-04T06:31:06Z) - Collaborative Distillation in the Parameter and Spectrum Domains for
Video Action Recognition [79.60708268515293]
This paper explores how to train small and efficient networks for action recognition.
We propose two distillation strategies in the frequency domain, namely the feature spectrum and parameter distribution distillations respectively.
Our method can achieve higher performance than state-of-the-art methods with the same backbone.
arXiv Detail & Related papers (2020-09-15T07:29:57Z) - Learning Temporally Invariant and Localizable Features via Data
Augmentation for Video Recognition [9.860323576151897]
In image recognition, learning spatially invariant features is a key factor in improving recognition performance and augmentation.
In this study, we extend these strategies to the temporal dimension for videos to learn temporally invariant or temporally local features.
Based on our novel temporal data augmentation algorithms, video recognition performances are improved using only a limited amount of training data.
arXiv Detail & Related papers (2020-08-13T06:56:52Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.