Related papers: Text to speech synthesis

Text to speech synthesis

URL: http://arxiv.org/abs/2401.13891v1
Date: Thu, 25 Jan 2024 02:13:45 GMT
Title: Text to speech synthesis
Authors: Harini s, Manoj G M
Abstract summary: Text-to-speech synthesis (TTS) is a technology that converts written text into spoken words. This abstract explores the key aspects of TTS synthesis, encompassing its underlying technologies, applications, and implications for various sectors.
Score: 0.27195102129095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-speech (TTS) synthesis is a technology that converts written text into spoken words, enabling a natural and accessible means of communication. This abstract explores the key aspects of TTS synthesis, encompassing its underlying technologies, applications, and implications for various sectors. The technology utilizes advanced algorithms and linguistic models to convert textual information into life like speech, allowing for enhanced user experiences in diverse contexts such as accessibility tools, navigation systems, and virtual assistants. The abstract delves into the challenges and advancements in TTS synthesis, including considerations for naturalness, multilingual support, and emotional expression in synthesized speech.

Related papers

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey [14.461679448919751]
Text-to-speech (TTS) aims to generate natural-sounding human speech from text. TTS technologies have evolved beyond human-like speech to enabling controllable speech generation. Deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS.
arXiv Detail & Related papers (2024-12-09T15:50:25Z)
Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis [3.8251125989631674]
We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system. It derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech. Our system showcases competitive inference time performance when benchmarked against state-of-the-art TTS models.
arXiv Detail & Related papers (2024-10-24T23:18:02Z)
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information. EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z)
A review-based study on different Text-to-Speech technologies [0.0]
The paper examines the different TTS technologies available, including concatenative TTS, formant synthesis TTS, and statistical parametric TTS. The study focuses on comparing the advantages and limitations of these technologies in terms of their naturalness of voice, the level of complexity of the system, and their suitability for different applications.
arXiv Detail & Related papers (2023-12-17T20:07:23Z)
Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication. We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Contextual Expressive Text-to-Speech [25.050361896378533]
We introduce a new task setting, Contextual Text-to-speech (CTTS) The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. We construct a synthetic dataset and develop an effective framework to generate high-quality expressive speech based on the given context.
arXiv Detail & Related papers (2022-11-26T12:06:21Z)
An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP. We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z)
Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages. We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources. We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z)
A Survey on Neural Speech Synthesis [110.39292386792555]
Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities. We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
arXiv Detail & Related papers (2021-06-29T16:50:51Z)
Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy. Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches. We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)
Review of end-to-end speech synthesis technology based on deep learning [10.748200013505882]
Research focus is the deep learning-based end-to-end speech synthesis technology. It mainly consists of three modules: text front-end, acoustic model, and vocoder. This paper summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks.
arXiv Detail & Related papers (2021-04-20T14:24:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.