Plug-and-Play Multilingual Few-shot Spoken Words Recognition
- URL: http://arxiv.org/abs/2305.03058v1
- Date: Wed, 3 May 2023 18:58:14 GMT
- Title: Plug-and-Play Multilingual Few-shot Spoken Words Recognition
- Authors: Aaqib Saeed and Vasileios Tsouvalas
- Abstract summary: We propose PLiX, a multilingual and plug-and-play keyword spotting system.
Our few-shot deep models are learned with millions of one-second audio clips across 20 languages.
We show that PLiX can generalize to novel spoken words given as few as just one support example.
- Score: 3.591566487849146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As technology advances and digital devices become prevalent, seamless
human-machine communication is increasingly gaining significance. The growing
adoption of mobile, wearable, and other Internet of Things (IoT) devices has
changed how we interact with these smart devices, making accurate spoken words
recognition a crucial component for effective interaction. However, building
robust spoken words detection system that can handle novel keywords remains
challenging, especially for low-resource languages with limited training data.
Here, we propose PLiX, a multilingual and plug-and-play keyword spotting system
that leverages few-shot learning to harness massive real-world data and enable
the recognition of unseen spoken words at test-time. Our few-shot deep models
are learned with millions of one-second audio clips across 20 languages,
achieving state-of-the-art performance while being highly efficient. Extensive
evaluations show that PLiX can generalize to novel spoken words given as few as
just one support example and performs well on unseen languages out of the box.
We release models and inference code to serve as a foundation for future
research and voice-enabled user interface development for emerging devices.
Related papers
- A Transformer-Based Multi-Stream Approach for Isolated Iranian Sign Language Recognition [0.0]
This research aims to recognize Iranian Sign Language words with the help of the latest deep learning tools such as transformers.
The dataset used includes 101 Iranian Sign Language words frequently used in academic environments such as universities.
arXiv Detail & Related papers (2024-06-27T06:54:25Z) - Seamless: Multilingual Expressive and Streaming Speech Translation [71.12826355107889]
We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.
First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2.
We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
arXiv Detail & Related papers (2023-12-08T17:18:42Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - Real-time low-resource phoneme recognition on edge devices [0.0]
This paper shows how to create and train models for speech recognition in any language.
It allows training models to recognize any language and deploying them on edge devices such as mobile phones or car displays for fast real-time speech recognition.
arXiv Detail & Related papers (2021-03-25T17:34:59Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.