EURO: ESPnet Unsupervised ASR Open-source Toolkit
- URL: http://arxiv.org/abs/2211.17196v3
- Date: Sun, 21 May 2023 00:56:05 GMT
- Title: EURO: ESPnet Unsupervised ASR Open-source Toolkit
- Authors: Dongji Gao and Jiatong Shi and Shun-Po Chuang and Leibny Paola Garcia
and Hung-yi Lee and Shinji Watanabe and Sanjeev Khudanpur
- Abstract summary: This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR)
EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, which leverages self-supervised speech representations and adversarial training.
Three mainstream self-supervised models demonstrate the toolkit's effectiveness and achieve state-of-the-art UASR performance on TIMIT and LibriSpeech datasets.
- Score: 92.57256779851095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO),
an end-to-end open-source toolkit for unsupervised automatic speech recognition
(UASR). EURO adopts the state-of-the-art UASR learning method introduced by the
Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised
speech representations and adversarial training. In addition to wav2vec2, EURO
extends the functionality and promotes reproducibility for UASR tasks by
integrating S3PRL and k2, resulting in flexible frontends from 27
self-supervised models and various graph-based decoding strategies. EURO is
implemented in ESPnet and follows its unified pipeline to provide UASR recipes
with a complete setup. This improves the pipeline's efficiency and allows EURO
to be easily applied to existing datasets in ESPnet. Extensive experiments on
three mainstream self-supervised models demonstrate the toolkit's effectiveness
and achieve state-of-the-art UASR performance on TIMIT and LibriSpeech
datasets. EURO will be publicly available at https://github.com/espnet/espnet,
aiming to promote this exciting and emerging research area based on UASR
through open-source activity.
Related papers
- Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning [4.051777802443125]
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations.
We introduce Gradient SAEs, which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function.
We find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts.
arXiv Detail & Related papers (2024-11-15T18:03:52Z) - Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Multilingual Speech Recognition using Knowledge Transfer across Learning
Processes [15.927513451432946]
Experimental results reveal the best pre-training strategy resulting in 3.55% relative reduction in overall WER.
A combination of LEAP and SSL yields 3.51% relative reduction in overall WER when using language ID.
arXiv Detail & Related papers (2021-10-15T07:50:27Z) - Self-play Learning Strategies for Resource Assignment in Open-RAN
Networks [3.763743638851161]
Open Radio Access Network (ORAN) is being developed with an aim to democratise access and lower the cost of future mobile data networks.
In ORAN, network functionality is dis-aggregated into remote units (RUs), distributed units (DUs) and central units (CUs)
arXiv Detail & Related papers (2021-03-03T19:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.