Acoustic Word Embedding System for Code-Switching Query-by-example
Spoken Term Detection
- URL: http://arxiv.org/abs/2005.11777v1
- Date: Sun, 24 May 2020 15:27:56 GMT
- Title: Acoustic Word Embedding System for Code-Switching Query-by-example
Spoken Term Detection
- Authors: Murong Ma, Haiwei Wu, Xuyang Wang, Lin Yang, Junjie Wang and Ming Li
- Abstract summary: We propose a deep convolutional neural network-based acoustic word embedding system on code-switching query by example spoken term detection.
We combine audio data in two languages for training instead of only using one single language.
- Score: 17.54377669932433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a deep convolutional neural network-based acoustic
word embedding system on code-switching query by example spoken term detection.
Different from previous configurations, we combine audio data in two languages
for training instead of only using one single language. We transform the
acoustic features of keyword templates and searching content to
fixed-dimensional vectors and calculate the distances between keyword segments
and searching content segments obtained in a sliding manner. An auxiliary
variability-invariant loss is also applied to training data within the same
word but different speakers. This strategy is used to prevent the extractor
from encoding undesired speaker- or accent-related information into the
acoustic word embeddings. Experimental results show that our proposed system
produces promising searching results in the code-switching test scenario. With
the increased number of templates and the employment of variability-invariant
loss, the searching performance is further enhanced.
Related papers
- Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.
We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z) - Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting [23.627625026135505]
We propose a novel end-to-end user-defined keyword spotting method.
Our method compares input queries with an enrolled text keyword sequence.
We introduce the LibriPhrase dataset for efficiently training keyword spotting models.
arXiv Detail & Related papers (2022-06-30T16:40:31Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Acoustic Neighbor Embeddings [2.842794675894731]
This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings.
The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences.
The recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.
arXiv Detail & Related papers (2020-07-20T05:33:07Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.