Affective social anthropomorphic intelligent system
- URL: http://arxiv.org/abs/2304.11046v1
- Date: Wed, 19 Apr 2023 18:24:57 GMT
- Title: Affective social anthropomorphic intelligent system
- Authors: Md. Adyelullahil Mamun, Hasnat Md. Abdullah, Md. Golam Rabiul Alam,
Muhammad Mehedi Hassan and Md. Zia Uddin
- Abstract summary: This research proposes an anthropomorphic intelligent system that can hold a proper human-like conversation with emotion and personality.
A voice style transfer method is also proposed to map the attributes of a specific emotion.
- Score: 1.7849339006560665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human conversational styles are measured by the sense of humor, personality,
and tone of voice. These characteristics have become essential for
conversational intelligent virtual assistants. However, most of the
state-of-the-art intelligent virtual assistants (IVAs) are failed to interpret
the affective semantics of human voices. This research proposes an
anthropomorphic intelligent system that can hold a proper human-like
conversation with emotion and personality. A voice style transfer method is
also proposed to map the attributes of a specific emotion. Initially, the
frequency domain data (Mel-Spectrogram) is created by converting the temporal
audio wave data, which comprises discrete patterns for audio features such as
notes, pitch, rhythm, and melody. A collateral CNN-Transformer-Encoder is used
to predict seven different affective states from voice. The voice is also fed
parallelly to the deep-speech, an RNN model that generates the text
transcription from the spectrogram. Then the transcripted text is transferred
to the multi-domain conversation agent using blended skill talk,
transformer-based retrieve-and-generate generation strategy, and beam-search
decoding, and an appropriate textual response is generated. The system learns
an invertible mapping of data to a latent space that can be manipulated and
generates a Mel-spectrogram frame based on previous Mel-spectrogram frames to
voice synthesize and style transfer. Finally, the waveform is generated using
WaveGlow from the spectrogram. The outcomes of the studies we conducted on
individual models were auspicious. Furthermore, users who interacted with the
system provided positive feedback, demonstrating the system's effectiveness.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [18.812696623555855]
We present a novel few shot multi-speaker speech synthesis approach (FSM-SS)
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner.
We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
arXiv Detail & Related papers (2020-12-14T04:37:07Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.