The GUA-Speech System Description for CNVSRC Challenge 2023
- URL: http://arxiv.org/abs/2312.07254v1
- Date: Tue, 12 Dec 2023 13:35:33 GMT
- Title: The GUA-Speech System Description for CNVSRC Challenge 2023
- Authors: Shengqiang Li, Chao Lei, Baozhong Ma, Binbin Zhang, Fuping Pan
- Abstract summary: This study describes our system for Task 1 Single-speaker Visual Speech Recognition (VSR) fixed track in the Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023.
We use intermediate connectionist temporal classification (Inter CTC) residual modules to relax the conditional independence assumption of CTC in our model.
We also use a bi-transformer decoder to enable the model to capture both past and future contextual information.
- Score: 8.5257557043542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study describes our system for Task 1 Single-speaker Visual Speech
Recognition (VSR) fixed track in the Chinese Continuous Visual Speech
Recognition Challenge (CNVSRC) 2023. Specifically, we use intermediate
connectionist temporal classification (Inter CTC) residual modules to relax the
conditional independence assumption of CTC in our model. Then we use a
bi-transformer decoder to enable the model to capture both past and future
contextual information. In addition, we use Chinese characters as the modeling
units to improve the recognition accuracy of our model. Finally, we use a
recurrent neural network language model (RNNLM) for shallow fusion in the
inference stage. Experiments show that our system achieves a character error
rate (CER) of 38.09% on the Eval set which reaches a relative CER reduction of
21.63% over the official baseline, and obtains a second place in the challenge.
Related papers
- The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in
CNVSRC 2023 [67.11294606070278]
This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023.
In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce multi-scale video data.
Various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, and color transformation.
arXiv Detail & Related papers (2024-01-07T14:20:52Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource
Speech Translation Tasks [8.651248939672769]
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation.
We build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR.
Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models.
arXiv Detail & Related papers (2022-05-04T10:36:57Z) - Two-Stream Consensus Network: Submission to HACS Challenge 2021
Weakly-Supervised Learning Track [78.64815984927425]
The goal of weakly-supervised temporal action localization is to temporally locate and classify action of interest in untrimmed videos.
We adopt the two-stream consensus network (TSCN) as the main framework in this challenge.
Our solution ranked 2rd in this challenge, and we hope our method can serve as a baseline for future academic research.
arXiv Detail & Related papers (2021-06-21T03:36:36Z) - Three-class Overlapped Speech Detection using a Convolutional Recurrent
Neural Network [32.59704287230343]
The proposed approach classifies into three classes: non-speech, single speaker speech, and overlapped speech.
A convolutional recurrent neural network architecture is explored to benefit from both convolutional layer's capability to model local patterns and recurrent layer's ability to model sequential information.
The proposed overlapped speech detection model establishes a state-of-the-art performance with a precision of 0.6648 and a recall of 0.3222 on the DIHARD II evaluation set.
arXiv Detail & Related papers (2021-04-07T03:01:34Z) - Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations.
The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z) - Vector-quantized neural networks for acoustic unit discovery in the
ZeroSpeech 2020 challenge [26.114011076658237]
We propose two neural models to tackle the problem of learning discrete representations of speech.
The first model is a type of vector-quantized variational autoencoder (VQ-VAE)
The second model combines vector quantization with contrastive predictive coding (VQ-CPC)
We evaluate the models on English and Indonesian data for the ZeroSpeech 2020 challenge.
arXiv Detail & Related papers (2020-05-19T13:06:17Z) - Rnn-transducer with language bias for end-to-end Mandarin-English
code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem.
We use the language identities to bias the model to predict the CS points.
This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.