Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features
- URL: http://arxiv.org/abs/2011.01986v1
- Date: Tue, 3 Nov 2020 20:06:48 GMT
- Title: Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features
- Authors: Man-Ling Sung and Siyuan Feng and Tan Lee
- Abstract summary: We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
- Score: 41.951988293049205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The present study tackles the problem of automatically discovering spoken
keywords from untranscribed audio archives without requiring word-by-word
speech transcription by automatic speech recognition (ASR) technology. The
problem is of practical significance in many applications of speech analytics,
including those concerning low-resource languages, and large amount of
multilingual and multi-genre data. We propose a two-stage approach, which
comprises unsupervised acoustic modeling and decoding, followed by pattern
mining in acoustic unit sequences. The whole process starts by deriving and
modeling a set of subword-level speech units with untranscribed data. With the
unsupervisedly trained acoustic models, a given audio archive is represented by
a pseudo transcription, from which spoken keywords can be discovered by string
mining algorithms. For unsupervised acoustic modeling, a deep neural network
trained by multilingual speech corpora is used to generate speech segmentation
and compute bottleneck features for segment clustering. Experimental results
show that the proposed system is able to effectively extract topic-related
words and phrases from the lecture recordings on MIT OpenCourseWare.
Related papers
- dMel: Speech Tokenization made Simple [19.169460770473908]
We show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel)
Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework.
arXiv Detail & Related papers (2024-07-22T17:51:53Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Multilingual acoustic word embeddings for zero-resource languages [1.5229257192293204]
It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments.
The study introduces a new neural network that outperforms existing AWE models on zero-resource languages.
AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts.
arXiv Detail & Related papers (2024-01-19T08:02:37Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.