On the Role of Style in Parsing Speech with Neural Models
- URL: http://arxiv.org/abs/2010.04288v1
- Date: Thu, 8 Oct 2020 22:44:19 GMT
- Title: On the Role of Style in Parsing Speech with Neural Models
- Authors: Trang Tran, Jiahong Yuan, Yang Liu, Mari Ostendorf
- Abstract summary: We show that neural approaches facilitate using written text to improve parsing of spontaneous speech.
We find an asymmetric degradation from read vs. spontaneous speech.
- Score: 25.442727974788255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The differences in written text and conversational speech are substantial;
previous parsers trained on treebanked text have given very poor results on
spontaneous speech. For spoken language, the mismatch in style also extends to
prosodic cues, though it is less well understood. This paper re-examines the
use of written text in parsing speech in the context of recent advances in
neural language processing. We show that neural approaches facilitate using
written text to improve parsing of spontaneous speech, and that prosody further
improves over this state-of-the-art result. Further, we find an asymmetric
degradation from read vs. spontaneous mismatch, with spontaneous speech more
generally useful for training parsers.
Related papers
- Continuous Speech Tokenizer in Text To Speech [27.057221389827735]
We propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens.
Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS)
This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain.
arXiv Detail & Related papers (2024-10-22T15:02:37Z) - Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach [14.5696754689252]
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible.
We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations.
arXiv Detail & Related papers (2024-09-16T10:29:15Z) - Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts? [4.148732457277201]
Authorship verification is the task of determining if two distinct writing samples share the same author.
In this paper, we explore the attribution of transcribed speech, which poses novel challenges.
We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts.
arXiv Detail & Related papers (2023-11-13T18:54:17Z) - Audio-Visual Neural Syntax Acquisition [91.14892278795892]
We study phrase structure induction from visually-grounded speech.
We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text.
arXiv Detail & Related papers (2023-10-11T16:54:57Z) - Towards Spontaneous Style Modeling with Semi-supervised Pre-training for
Conversational Text-to-Speech Synthesis [53.511443791260206]
We propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels.
In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech.
arXiv Detail & Related papers (2023-08-31T09:50:33Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired
Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech.
Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z) - Transcribing Natural Languages for The Deaf via Neural Editing Programs [84.0592111546958]
We study the task of glossification, of which the aim is to em transcribe natural spoken language sentences for the Deaf (hard-of-hearing) community to ordered sign language glosses.
Previous sequence-to-sequence language models often fail to capture the rich connections between the two distinct languages, leading to unsatisfactory transcriptions.
We observe that despite different grammars, glosses effectively simplify sentences for the ease of deaf communication, while sharing a large portion of vocabulary with sentences.
arXiv Detail & Related papers (2021-12-17T16:21:49Z) - Fluent and Low-latency Simultaneous Speech-to-Speech Translation with
Self-adaptive Training [40.71155396456831]
Simultaneous speech-to-speech translation is widely useful but extremely challenging.
It needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay.
Current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower.
We propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates.
arXiv Detail & Related papers (2020-10-20T06:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.