Efficient Ensemble for Multimodal Punctuation Restoration using
Time-Delay Neural Network
- URL: http://arxiv.org/abs/2302.13376v2
- Date: Sat, 24 Feb 2024 07:02:38 GMT
- Title: Efficient Ensemble for Multimodal Punctuation Restoration using
Time-Delay Neural Network
- Authors: Xing Yi Liu and Homayoon Beigi
- Abstract summary: Punctuation restoration plays an essential role in the post-processing procedure of automatic speech recognition.
We present EfficientPunct, an ensemble method with a multimodal time-delay neural network.
It outperforms the current best model by 1.0 F1 points, using less than a tenth of its inference network parameters.
- Score: 1.006218778776515
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Punctuation restoration plays an essential role in the post-processing
procedure of automatic speech recognition, but model efficiency is a key
requirement for this task. To that end, we present EfficientPunct, an ensemble
method with a multimodal time-delay neural network that outperforms the current
best model by 1.0 F1 points, using less than a tenth of its inference network
parameters. We streamline a speech recognizer to efficiently output hidden
layer acoustic embeddings for punctuation restoration, as well as BERT to
extract meaningful text embeddings. By using forced alignment and temporal
convolutions, we eliminate the need for attention-based fusion, greatly
increasing computational efficiency and raising performance. EfficientPunct
sets a new state of the art with an ensemble that weights BERT's purely
language-based predictions slightly more than the multimodal network's
predictions. Our code is available at
https://github.com/lxy-peter/EfficientPunct.
Related papers
- SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - Multi-scale temporal network for continuous sign language recognition [10.920363368754721]
Continuous Sign Language Recognition is a challenging research task due to the lack of accurate annotation on the temporal sequence of sign language data.
This paper proposes a multi-scale temporal network (MSTNet) to extract more accurate temporal features.
Experimental results on two publicly available datasets demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge.
arXiv Detail & Related papers (2022-04-08T06:14:22Z) - Time-Domain Mapping Based Single-Channel Speech Separation With
Hierarchical Constraint Training [10.883458728718047]
Single-channel speech separation is required for multi-speaker speech recognition.
Recent deep learning-based approaches focused on time-domain audio separation net (TasNet)
We introduce attention augmented DPRNN (AttnAugDPRNN) which directly approximates the clean sources from the mixture for speech separation.
arXiv Detail & Related papers (2021-10-20T14:42:50Z) - Broadcasted Residual Learning for Efficient Keyword Spotting [7.335747584353902]
We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load.
We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning.
BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively.
arXiv Detail & Related papers (2021-06-08T06:55:39Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - An Effective Contextual Language Modeling Framework for Speech
Summarization with Augmented Features [13.97006782398121]
Bidirectional Representations from Transformers (BERT) model was proposed and has achieved record-breaking success on many natural language processing tasks.
We explore the incorporation of confidence scores into sentence representations to see if such an attempt could help alleviate the negative effects caused by imperfect automatic speech recognition.
We validate the effectiveness of our proposed method on a benchmark dataset.
arXiv Detail & Related papers (2020-06-01T18:27:48Z) - WaveCRN: An Efficient Convolutional Recurrent Neural Network for
End-to-end Speech Enhancement [31.236720440495994]
In this paper, we propose an efficient E2E SE model, termed WaveCRN.
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU)
In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers.
arXiv Detail & Related papers (2020-04-06T13:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.