End-to-end model for named entity recognition from speech without paired
training data
- URL: http://arxiv.org/abs/2204.00803v1
- Date: Sat, 2 Apr 2022 08:14:27 GMT
- Title: End-to-end model for named entity recognition from speech without paired
training data
- Authors: Salima Mdhaffar, Jarod Duret, Titouan Parcollet, Yannick Est\`eve
- Abstract summary: We propose an approach to build an end-to-end neural model to extract semantic information.
Our approach is based on the use of an external model trained to generate a sequence of vectorial representations from text.
Experiments on named entity recognition, carried out on the QUAERO corpus, show that this approach is very promising.
- Score: 12.66131972249388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works showed that end-to-end neural approaches tend to become very
popular for spoken language understanding (SLU). Through the term end-to-end,
one considers the use of a single model optimized to extract semantic
information directly from the speech signal. A major issue for such models is
the lack of paired audio and textual data with semantic annotation. In this
paper, we propose an approach to build an end-to-end neural model to extract
semantic information in a scenario in which zero paired audio data is
available. Our approach is based on the use of an external model trained to
generate a sequence of vectorial representations from text. These
representations mimic the hidden representations that could be generated inside
an end-to-end automatic speech recognition (ASR) model by processing a speech
signal. An SLU neural module is then trained using these representations as
input and the annotated text as output. Last, the SLU module replaces the top
layers of the ASR model to achieve the construction of the end-to-end model.
Our experiments on named entity recognition, carried out on the QUAERO corpus,
show that this approach is very promising, getting better results than a
comparable cascade approach or than the use of synthetic voices.
Related papers
- BrewCLIP: A Bifurcated Representation Learning Framework for Audio-Visual Retrieval [3.347768376390811]
We investigate whether non-textual information, which is overlooked by pipeline-based models, can be leveraged to improve speech-image matching performance.
Our approach achieves a substantial performance gain over the previous state-of-the-art by leveraging strong pretrained models, a prompting mechanism and a bifurcated design.
arXiv Detail & Related papers (2024-08-19T19:56:10Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Speech Summarization using Restricted Self-Attention [79.89680891246827]
We introduce a single model optimized end-to-end for speech summarization.
We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
arXiv Detail & Related papers (2021-10-12T18:21:23Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - End-to-End Neural Transformer Based Spoken Language Understanding [14.736425160859284]
Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals.
We introduce an end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots embedded in an audio signal.
Our end-to-end transformer SLU predicts the domains, intents and slots in the Fluent Speech Commands dataset with accuracy equal to 98.1 %, 99.6 %, and 99.6 %, respectively.
arXiv Detail & Related papers (2020-08-12T22:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.