Related papers: Identification of primary and collateral tracks in stuttered speech

Identification of primary and collateral tracks in stuttered speech

URL: http://arxiv.org/abs/2003.01018v1
Date: Mon, 2 Mar 2020 16:50:33 GMT
Title: Identification of primary and collateral tracks in stuttered speech
Authors: Rachid Riad, Anne-Catherine Bachoud-L\'evi, Frank Rudzicz, Emmanuel Dupoux
Abstract summary: We introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective. We present a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews. We show experimentally that using word-based span features outperformed the baselines for speech-based predictions.
Score: 22.921077940732
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Disfluent speech has been previously addressed from two main perspectives: the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the different contributions. Here, we introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective together with the theory of performance from \cite{clark1996using} which distinguishes between primary and collateral tracks. We introduce a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews, and present baseline results directly comparing the performance of text-based features (word and span information) and speech-based (acoustic-prosodic information). Finally, we introduce new audio features inspired by the word-based span features. We show experimentally that using these features outperformed the baselines for speech-based predictions on the present dataset.

Related papers

Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z)
Audio-Visual Neural Syntax Acquisition [91.14892278795892]
We study phrase structure induction from visually-grounded speech. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text.
arXiv Detail & Related papers (2023-10-11T16:54:57Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis [3.691712391306624]
We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step.
arXiv Detail & Related papers (2022-11-01T15:17:25Z)
Unsupervised Multimodal Word Discovery based on Double Articulation Analysis with Co-occurrence cues [7.332652485849632]
Human infants acquire their verbal lexicon with minimal prior knowledge of language. This study proposes a novel fully unsupervised learning method for discovering speech units. The proposed method can acquire words and phonemes from speech signals using unsupervised learning.
arXiv Detail & Related papers (2022-01-18T07:31:59Z)
Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment. The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z)
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z)
Deep Learning for Prominence Detection in Children's Read Speech [13.041607703862724]
We consider a labeled dataset of children's reading recordings for the speaker-independent detection of prominent words. A previous well-tuned random forest ensemble predictor is replaced by an RNN sequence to exploit potential context dependency. Deep learning is applied to obtain word-level features from low-level acoustic contours of fundamental frequency, intensity and spectral shape.
arXiv Detail & Related papers (2021-04-12T14:15:08Z)
Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models. We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z)
Comparison of Speaker Role Recognition and Speaker Enrollment Protocol for conversational Clinical Interviews [9.728371067160941]
We train end-to-end neural network architectures to adapt to each task and evaluate each approach under the same metric. Results do not depend on the demographics of the Interviewee, highlighting the clinical relevance of our methods.
arXiv Detail & Related papers (2020-10-30T09:07:37Z)
Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis. We learn sentiment, aspect> joint topic embeddings in the word embedding space. We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.