Estimating Presentation Competence using Multimodal Nonverbal Behavioral
Cues
- URL: http://arxiv.org/abs/2105.02636v1
- Date: Thu, 6 May 2021 13:09:41 GMT
- Title: Estimating Presentation Competence using Multimodal Nonverbal Behavioral
Cues
- Authors: \"Omer S\"umer and Cigdem Beyan and Fabian Ruth and Olaf Kramer and
Ulrich Trautwein and Enkelejda Kasneci
- Abstract summary: Public speaking and presentation competence plays an essential role in many areas of social interaction.
One approach that can promote efficient development of presentation competence is the automated analysis of human behavior during a speech.
In this work, we investigate the contribution of different nonverbal behavioral cues, namely, facial, body pose-based, and audio-related features, to estimate presentation competence.
- Score: 7.340483819263093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Public speaking and presentation competence plays an essential role in many
areas of social interaction in our educational, professional, and everyday
life. Since our intention during a speech can differ from what is actually
understood by the audience, the ability to appropriately convey our message
requires a complex set of skills. Presentation competence is cultivated in the
early school years and continuously developed over time. One approach that can
promote efficient development of presentation competence is the automated
analysis of human behavior during a speech based on visual and audio features
and machine learning. Furthermore, this analysis can be used to suggest
improvements and the development of skills related to presentation competence.
In this work, we investigate the contribution of different nonverbal behavioral
cues, namely, facial, body pose-based, and audio-related features, to estimate
presentation competence. The analyses were performed on videos of 251 students
while the automated assessment is based on manual ratings according to the
T\"ubingen Instrument for Presentation Competence (TIP). Our classification
results reached the best performance with early fusion in the same dataset
evaluation (accuracy of 71.25%) and late fusion of speech, face, and body pose
features in the cross dataset evaluation (accuracy of 78.11%). Similarly,
regression results performed the best with fusion strategies.
Related papers
- Real-time Addressee Estimation: Deployment of a Deep-Learning Model on
the iCub Robot [52.277579221741746]
Addressee Estimation is a skill essential for social robots to interact smoothly with humans.
Inspired by human perceptual skills, a deep-learning model for Addressee Estimation is designed, trained, and deployed on an iCub robot.
The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction.
arXiv Detail & Related papers (2023-11-09T13:01:21Z) - Acoustic and linguistic representations for speech continuous emotion
recognition in call center conversations [2.0653090022137697]
We explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus.
Our experiments confirm the large gain in performance obtained with the use of pre-trained features.
Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction.
arXiv Detail & Related papers (2023-10-06T10:22:51Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - Measuring the Impact of Individual Domain Factors in Self-Supervised
Pre-Training [60.825471653739555]
We show that phonetic domain factors play an important role during pre-training while grammatical and syntactic factors are far less important.
This is the first study to better understand the domain characteristics of pre-trained sets in self-supervised pre-training for speech.
arXiv Detail & Related papers (2022-03-01T17:40:51Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Towards the evaluation of simultaneous speech translation from a
communicative perspective [0.0]
We present the results of an experiment aimed at evaluating the quality of a simultaneous speech translation engine.
We found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness.
arXiv Detail & Related papers (2021-03-15T13:09:00Z) - Embedded Emotions -- A Data Driven Approach to Learn Transferable
Feature Representations from Raw Speech Input for Emotion Recognition [1.4556324908347602]
We investigate the applicability of transferring knowledge learned from large text and audio corpora to the task of automatic emotion recognition.
Our results show that the learned feature representations can be effectively applied for classifying emotions from spoken language.
arXiv Detail & Related papers (2020-09-30T09:18:31Z) - Leveraging Multimodal Behavioral Analytics for Automated Job Interview
Performance Assessment and Feedback [0.5872014229110213]
Behavioral cues play a significant part in human communication and cognitive perception.
We propose a multimodal analytical framework that analyzes the candidate in an interview scenario.
We use these multimodal data sources to construct a composite representation, which is used for training machine learning classifiers to predict the class labels.
arXiv Detail & Related papers (2020-06-14T14:20:42Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z) - You Impress Me: Dialogue Generation via Mutual Persona Perception [62.89449096369027]
The research in cognitive science suggests that understanding is an essential signal for a high-quality chit-chat conversation.
Motivated by this, we propose P2 Bot, a transmitter-receiver based framework with the aim of explicitly modeling understanding.
arXiv Detail & Related papers (2020-04-11T12:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.