Prosodic segmentation for parsing spoken dialogue
- URL: http://arxiv.org/abs/2105.12667v1
- Date: Wed, 26 May 2021 16:30:16 GMT
- Title: Prosodic segmentation for parsing spoken dialogue
- Authors: Elizabeth Nielsen, Mark Steedman, Sharon Goldwater
- Abstract summary: Parsing spoken dialogue poses unique difficulties, including disfluencies and unmarked boundaries.
Previous work has shown that prosody can help with parsing disfluent speech.
We show that prosody can effectively replace gold standard SU boundaries.
- Score: 29.68201160277817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parsing spoken dialogue poses unique difficulties, including disfluencies and
unmarked boundaries between sentence-like units. Previous work has shown that
prosody can help with parsing disfluent speech (Tran et al. 2018), but has
assumed that the input to the parser is already segmented into sentence-like
units (SUs), which isn't true in existing speech applications. We investigate
how prosody affects a parser that receives an entire dialogue turn as input (a
turn-based model), instead of gold standard pre-segmented SUs (an SU-based
model). In experiments on the English Switchboard corpus, we find that when
using transcripts alone, the turn-based model has trouble segmenting SUs,
leading to worse parse performance than the SU-based model. However, prosody
can effectively replace gold standard SU boundaries: with prosody, the
turn-based model performs as well as the SU-based model (90.79 vs. 90.65 F1
score, respectively), despite performing two tasks (SU segmentation and
parsing) rather than one (parsing alone). Analysis shows that pitch and
intensity features are the most important for this corpus, since they allow the
model to correctly distinguish an SU boundary from a speech disfluency -- a
distinction that the model otherwise struggles to make.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - What's Hard in English RST Parsing? Predictive Models for Error Analysis [16.927386793787463]
In this paper, we examine and model some of the factors associated with parsing difficulties in Rhetorical Structure Theory.
Our results show that as in shallow discourse parsing, the explicit/implicit distinction plays a role, but that long-distance dependencies are the main challenge.
Our final model is able to predict where errors will occur with an accuracy of 76.3% for the bottom-up and 76.6% for the top-down.
arXiv Detail & Related papers (2023-09-10T06:10:03Z) - Cascading and Direct Approaches to Unsupervised Constituency Parsing on
Spoken Sentences [67.37544997614646]
We present the first study on unsupervised spoken constituency parsing.
The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees.
We show that accurate segmentation alone may be sufficient to parse spoken sentences accurately.
arXiv Detail & Related papers (2023-03-15T17:57:22Z) - Prosodic features improve sentence segmentation and parsing [28.41406899452548]
We show the effect of prosody on parsing speech that isn't segmented into sentences.
We find prosody helps our model both with parsing and accurately identifying sentence boundaries.
arXiv Detail & Related papers (2023-02-23T17:03:36Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z) - Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications.
We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.