Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child
Multimodal Interaction Dataset
- URL: http://arxiv.org/abs/2008.09207v1
- Date: Thu, 20 Aug 2020 20:53:23 GMT
- Title: Dyadic Speech-based Affect Recognition using DAMI-P2C Parent-child
Multimodal Interaction Dataset
- Authors: Huili Chen and Yue Zhang and Felix Weninger and Rosalind Picard and
Cynthia Breazeal and Hae Won Park
- Abstract summary: We design end-to-end deep learning methods to recognize each person's affective expression in an audio stream with two speakers.
Our results show that the proposed weighted-pooling attention solutions are able to learn to focus on the regions containing target speaker's affective information.
- Score: 29.858195646762297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech-based affect recognition of individuals in dyadic
conversation is a challenging task, in part because of its heavy reliance on
manual pre-processing. Traditional approaches frequently require hand-crafted
speech features and segmentation of speaker turns. In this work, we design
end-to-end deep learning methods to recognize each person's affective
expression in an audio stream with two speakers, automatically discovering
features and time regions relevant to the target speaker's affect. We integrate
a local attention mechanism into the end-to-end architecture and compare the
performance of three attention implementations -- one mean pooling and two
weighted pooling methods. Our results show that the proposed weighted-pooling
attention solutions are able to learn to focus on the regions containing target
speaker's affective information and successfully extract the individual's
valence and arousal intensity. Here we introduce and use a "dyadic affect in
multimodal interaction - parent to child" (DAMI-P2C) dataset collected in a
study of 34 families, where a parent and a child (3-7 years old) engage in
reading storybooks together. In contrast to existing public datasets for affect
recognition, each instance for both speakers in the DAMI-P2C dataset is
annotated for the perceived affect by three labelers. To encourage more
research on the challenging task of multi-speaker affect sensing, we make the
annotated DAMI-P2C dataset publicly available, including acoustic features of
the dyads' raw audios, affect annotations, and a diverse set of developmental,
social, and demographic profiles of each dyad.
Related papers
- Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System [73.34663391495616]
We propose a pioneering approach to tackle joint multi-talker and target-talker speech recognition tasks.
Specifically, we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers.
We deliver acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
arXiv Detail & Related papers (2024-07-13T09:28:24Z) - Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication [17.294279444027563]
We introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners.
We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model.
Our approach demonstrates state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2024-03-28T14:47:32Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Audio-Visual Activity Guided Cross-Modal Identity Association for Active
Speaker Detection [37.28070242751129]
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality.
We propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection.
arXiv Detail & Related papers (2022-12-01T14:46:00Z) - Self-supervised speech unit discovery from articulatory and acoustic
features using VQ-VAE [2.771610203951056]
This study examines how articulatory information can be used for discovering speech units in a self-supervised setting.
We used vector-quantized variational autoencoders (VQ-VAE) to learn discrete representations from articulatory and acoustic speech data.
Experiments were conducted on three different corpora in English and French.
arXiv Detail & Related papers (2022-06-17T14:04:24Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Transferring Voice Knowledge for Acoustic Event Detection: An Empirical
Study [11.825240267691209]
This paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an acoustic event detection pipeline.
We develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process.
arXiv Detail & Related papers (2021-10-07T04:03:21Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.