Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded
Language from Percepts and Raw Speech
- URL: http://arxiv.org/abs/2112.13758v1
- Date: Mon, 27 Dec 2021 16:12:30 GMT
- Title: Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded
Language from Percepts and Raw Speech
- Authors: Gaoussou Youssouf Kebe, Luke E. Richards, Edward Raff, Francis
Ferraro, Cynthia Matuszek
- Abstract summary: Learning to understand grounded language, which connects natural language to percepts, is a critical research area.
In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs.
- Score: 26.076534338576234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning to understand grounded language, which connects natural language to
percepts, is a critical research area. Prior work in grounded language
acquisition has focused primarily on textual inputs. In this work we
demonstrate the feasibility of performing grounded language acquisition on
paired visual percepts and raw speech inputs. This will allow interactions in
which language about novel tasks and environments is learned from end users,
reducing dependence on textual inputs and potentially mitigating the effects of
demographic bias found in widely available speech recognition systems. We
leverage recent work in self-supervised speech representation models and show
that learned representations of speech can make language grounding systems more
inclusive towards specific groups while maintaining or even increasing general
performance.
Related papers
- CLARA: Multilingual Contrastive Learning for Audio Representation
Acquisition [5.520654376217889]
CLARA minimizes reliance on labelled data, enhancing generalization across languages.
Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues.
It adapts to low-resource languages, marking progress in multilingual speech representation learning.
arXiv Detail & Related papers (2023-10-18T09:31:56Z) - Acoustic and linguistic representations for speech continuous emotion
recognition in call center conversations [2.0653090022137697]
We explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus.
Our experiments confirm the large gain in performance obtained with the use of pre-trained features.
Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction.
arXiv Detail & Related papers (2023-10-06T10:22:51Z) - Learning Multilingual Expressive Speech Representation for Prosody
Prediction without Parallel Data [0.0]
We propose a method for speech-to-speech emotion translation that operates at the level of discrete speech units.
We show that this embedding can be used to predict the pitch and duration of speech units in a target language.
We evaluate our approach to English and French speech signals and show that it outperforms a baseline method.
arXiv Detail & Related papers (2023-06-29T08:06:54Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Can phones, syllables, and words emerge as side-products of
cross-situational audiovisual learning? -- A computational investigation [2.28438857884398]
We study the so-called latent language hypothesis (LLH)
LLH connects linguistic representation learning to general predictive processing within and across sensory modalities.
We explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning.
arXiv Detail & Related papers (2021-09-29T05:49:46Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Neural Variational Learning for Grounded Language Acquisition [14.567067583556714]
We propose a learning system in which language is grounded in visual percepts without specific pre-defined categories of terms.
We show that this generative approach exhibits promising results in language grounding without pre-specifying visual categories under low resource settings.
arXiv Detail & Related papers (2021-07-20T20:55:02Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.