Exploiting temporal information to detect conversational groups in videos and predict the next speaker
- URL: http://arxiv.org/abs/2408.16380v1
- Date: Thu, 29 Aug 2024 09:41:36 GMT
- Title: Exploiting temporal information to detect conversational groups in videos and predict the next speaker
- Authors: Lucrezia Tosato, Victor Fortier, Isabelle Bloch, Catherine Pelachaud,
- Abstract summary: This paper aims at detecting F formations in video sequences and predicting the next speaker in a group conversation.
We rely on measuring the engagement level of people as a feature of group belonging.
Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.
- Score: 2.7981106665946944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker's turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.
Related papers
- Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue.
This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Introducing MeMo: A Multimodal Dataset for Memory Modelling in Multiparty Conversations [1.8896253910986929]
MeMo corpus is the first dataset annotated with participants' memory retention reports.
It integrates validated behavioural and perceptual measures, audio, video, and multimodal annotations.
This paper aims to pave the way for future research in conversational memory modelling for intelligent system development.
arXiv Detail & Related papers (2024-09-07T16:09:36Z) - Target conversation extraction: Source separation using turn-taking dynamics [23.189364779538757]
We introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants.
Using neural networks, we show the feasibility of our approach on English and Mandarin conversation datasets.
In the presence of interfering speakers, our results show an 8.19 dB improvement in signal-to-noise ratio for 2-speaker conversations and a 7.92 dB improvement for 2-4-speaker conversations.
arXiv Detail & Related papers (2024-07-15T22:55:27Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - Conversation Group Detection With Spatio-Temporal Context [11.288403109735544]
We propose an approach for detecting conversation groups in social scenarios like cocktail parties and networking events.
We posit the detection of conversation groups as a learning problem that could benefit from leveraging the spatial context of the surroundings.
This motivates our approach which consists of a dynamic LSTM-based deep learning model that predicts continuous pairwise affinity values.
arXiv Detail & Related papers (2022-06-02T08:05:02Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Unsupervised Conversation Disentanglement through Co-Training [30.304609312675186]
We explore to train a conversation disentanglement model without referencing any human annotations.
Our method is built upon a deep co-training algorithm, which consists of two neural networks.
For the message-pair classifier, we enrich its training data by retrieving message pairs with high confidence.
arXiv Detail & Related papers (2021-09-07T17:05:18Z) - Detecting Speaker Personas from Conversational Texts [52.4557098875992]
We study a new task, named Speaker Persona Detection (SPD), which aims to detect speaker personas based on the plain conversational text.
We build a dataset for SPD, dubbed as Persona Match on Persona-Chat (PMPC)
We evaluate several baseline models and propose utterance-to-profile (U2P) matching networks for this task.
arXiv Detail & Related papers (2021-09-03T06:14:38Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.