How Generative Spoken Language Modeling Encodes Noisy Speech:
Investigation from Phonetics to Syntactics
- URL: http://arxiv.org/abs/2306.00697v1
- Date: Thu, 1 Jun 2023 14:07:19 GMT
- Title: How Generative Spoken Language Modeling Encodes Noisy Speech:
Investigation from Phonetics to Syntactics
- Authors: Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura, Kentaro Seki,
Detai Xin, Hiroshi Saruwatari
- Abstract summary: generative spoken language modeling (GSLM) involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis.
This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels.
- Score: 33.070158866023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We examine the speech modeling potential of generative spoken language
modeling (GSLM), which involves using learned symbols derived from data rather
than phonemes for speech analysis and synthesis. Since GSLM facilitates
textless spoken language processing, exploring its effectiveness is critical
for paving the way for novel paradigms in spoken-language processing. This
paper presents the findings of GSLM's encoding and decoding effectiveness at
the spoken-language and speech levels. Through speech resynthesis experiments,
we revealed that resynthesis errors occur at the levels ranging from phonology
to syntactics and GSLM frequently resynthesizes natural but content-altered
speech.
Related papers
- CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - Encoding of lexical tone in self-supervised models of spoken language [3.7270979204213446]
This paper aims to analyze the tone encoding capabilities of Spoken Language Models (SLMs)
We show that SLMs encode lexical tone to a significant degree even when they are trained on data from non-tonal languages.
We find that SLMs behave similarly to native and non-native human participants in tone and consonant perception studies.
arXiv Detail & Related papers (2024-03-25T15:28:38Z) - An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis [45.558316325252335]
Speech language models (LMs) are promising for high-quality speech synthesis through in-context learning.
We study how the synthesized audio is controlled by the prompt and content.
arXiv Detail & Related papers (2024-03-19T03:22:28Z) - Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)
USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.
Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer
Learning [3.5032870024762386]
This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech.
The approach involved finetuning a multi-speaker TTS model to work with child speech.
We conducted an objective assessment that showed a significant correlation between real and synthetic child voices.
arXiv Detail & Related papers (2023-11-07T19:31:44Z) - Homophone Disambiguation Reveals Patterns of Context Mixing in Speech
Transformers [12.44366147179659]
We investigate how measures of 'context-mixing' developed for text models can be adapted and applied to models of spoken language.
We identify a linguistic phenomenon that is ideal for such a case study: homophony in French.
Our findings reveal that representations in encoder-only models effectively incorporate these cues to identify the correct transcription, whereas encoders in encoder-decoder models mainly relegate the task of capturing contextual dependencies to decoder modules.
arXiv Detail & Related papers (2023-10-15T19:24:13Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.