SparseVSR: Lightweight and Noise Robust Visual Speech Recognition
- URL: http://arxiv.org/abs/2307.04552v1
- Date: Mon, 10 Jul 2023 13:34:13 GMT
- Title: SparseVSR: Lightweight and Noise Robust Visual Speech Recognition
- Authors: Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Alexandros
Haliassos, Stavros Petridis and Maja Pantic
- Abstract summary: We generate a lightweight model that achieves higher performance than its dense model equivalent.
Our results confirm that sparse networks are more resistant to noise than dense networks.
- Score: 100.43280310123784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in deep neural networks have achieved unprecedented success
in visual speech recognition. However, there remains substantial disparity
between current methods and their deployment in resource-constrained devices.
In this work, we explore different magnitude-based pruning techniques to
generate a lightweight model that achieves higher performance than its dense
model equivalent, especially under the presence of visual noise. Our sparse
models achieve state-of-the-art results at 10% sparsity on the LRS3 dataset and
outperform the dense equivalent up to 70% sparsity. We evaluate our 50% sparse
model on 7 different visual noise types and achieve an overall absolute
improvement of more than 2% WER compared to the dense equivalent. Our results
confirm that sparse networks are more resistant to noise than dense networks.
Related papers
- Robust Network Learning via Inverse Scale Variational Sparsification [55.64935887249435]
We introduce an inverse scale variational sparsification framework within a time-continuous inverse scale space formulation.
Unlike frequency-based methods, our approach not only removes noise by smoothing small-scale features.
We show the efficacy of our approach through enhanced robustness against various noise types.
arXiv Detail & Related papers (2024-09-27T03:17:35Z) - A Real-Time Voice Activity Detection Based On Lightweight Neural [4.589472292598182]
Voice activity detection (VAD) is the task of detecting speech in an audio stream.
Recent neural network-based VADs have alleviated the degradation of performance to some extent.
We propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU.
arXiv Detail & Related papers (2024-05-27T03:31:16Z) - Improved Generalization of Weight Space Networks via Augmentations [56.571475005291035]
Learning in deep weight spaces (DWS) is an emerging research direction, with applications to 2D and 3D neural fields (INRs, NeRFs)
We empirically analyze the reasons for this overfitting and find that a key reason is the lack of diversity in DWS datasets.
To address this, we explore strategies for data augmentation in weight spaces and propose a MixUp method adapted for weight spaces.
arXiv Detail & Related papers (2024-02-06T15:34:44Z) - Stuttering Detection Using Speaker Representations and Self-supervised
Contextual Embeddings [7.42741711946564]
We introduce the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks.
In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines.
arXiv Detail & Related papers (2023-06-01T14:00:47Z) - WeightMom: Learning Sparse Networks using Iterative Momentum-based
pruning [0.0]
We propose a weight based pruning approach in which the weights are pruned gradually based on their momentum of the previous iterations.
We evaluate our approach on networks such as AlexNet, VGG16 and ResNet50 with image classification datasets such as CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-08-11T07:13:59Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - Coresets for Robust Training of Neural Networks against Noisy Labels [78.03027938765746]
We propose a novel approach with strong theoretical guarantees for robust training of deep networks trained with noisy labels.
We select weighted subsets (coresets) of clean data points that provide an approximately low-rank Jacobian matrix.
Our experiments corroborate our theory and demonstrate that deep networks trained on our subsets achieve a significantly superior performance compared to state-of-the art.
arXiv Detail & Related papers (2020-11-15T04:58:11Z) - HALO: Learning to Prune Neural Networks with Shrinkage [5.283963846188862]
Deep neural networks achieve state-of-the-art performance in a variety of tasks by extracting a rich set of features from unstructured data.
Modern techniques for inducing sparsity and reducing model size are (1) network pruning, (2) training with a sparsity inducing penalty, and (3) training a binary mask jointly with the weights of the network.
We present a novel penalty called Hierarchical Adaptive Lasso which learns to adaptively sparsify weights of a given network via trainable parameters.
arXiv Detail & Related papers (2020-08-24T04:08:48Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.