Classification of Spontaneous and Scripted Speech for Multilingual Audio
- URL: http://arxiv.org/abs/2412.11896v1
- Date: Mon, 16 Dec 2024 15:45:10 GMT
- Title: Classification of Spontaneous and Scripted Speech for Multilingual Audio
- Authors: Shahar Elisha, Andrew McDowell, Mariano Beguerisse-Díaz, Emmanouil Benetos,
- Abstract summary: Distinguishing scripted from spontaneous speech is an essential tool for better understanding how speech styles influence speech processing research.
This paper addresses the challenge of building a classifier that generalises well across different formats and languages.
We systematically evaluate models ranging from traditional, handcrafted acoustic and prosodic features to advanced audio transformers.
- Score: 9.925703861731506
- License:
- Abstract: Distinguishing scripted from spontaneous speech is an essential tool for better understanding how speech styles influence speech processing research. It can also improve recommendation systems and discovery experiences for media users through better segmentation of large recorded speech catalogues. This paper addresses the challenge of building a classifier that generalises well across different formats and languages. We systematically evaluate models ranging from traditional, handcrafted acoustic and prosodic features to advanced audio transformers, utilising a large, multilingual proprietary podcast dataset for training and validation. We break down the performance of each model across 11 language groups to evaluate cross-lingual biases. Our experimental analysis extends to publicly available datasets to assess the models' generalisability to non-podcast domains. Our results indicate that transformer-based models consistently outperform traditional feature-based techniques, achieving state-of-the-art performance in distinguishing between scripted and spontaneous speech across various languages.
Related papers
- Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.
We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings [2.615008111842321]
We introduce an end-to-end scheme for topic segmentation using semantic speech encoders.
We propose a new benchmark for spoken news topic segmentation by utilizing a dataset featuring 1000 hours of publicly available recordings.
Our results demonstrate that while the traditional pipeline approach achieves a state-of-the-art $P_k$ score of 0.2431 for English, our end-to-end model delivers a competitive $P_k$ score of 0.2564.
arXiv Detail & Related papers (2024-09-10T05:24:36Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework.
It combines pre-trained speech and text foundation models via a small set of learnable parameters.
Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization [48.284512017469524]
Multi-turn dialogues are characterized by their extended length and the presence of turn-taking conversations.
Traditional language models often overlook the distinct features of these dialogues by treating them as regular text.
We propose a speaker-enhanced pre-training method for long dialogue summarization.
arXiv Detail & Related papers (2024-01-31T04:50:00Z) - Leveraging Language Model Capabilities for Sound Event Detection [10.792576135806623]
We propose an end-to-end framework for understanding audio features while simultaneously generating sound event and their temporal location.
Specifically, we employ pretrained acoustic models to capture discriminative features across different categories and language models for autoregressive text generation.
arXiv Detail & Related papers (2023-08-22T15:59:06Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.