No-audio speaking status detection in crowded settings via visual
pose-based filtering and wearable acceleration
- URL: http://arxiv.org/abs/2211.00549v1
- Date: Tue, 1 Nov 2022 15:55:48 GMT
- Title: No-audio speaking status detection in crowded settings via visual
pose-based filtering and wearable acceleration
- Authors: Jose Vargas-Quiros, Laura Cabrera-Quiros, Hayley Hung
- Abstract summary: Video and wearable sensors make it possible recognize speaking in an unobtrusive, privacy-preserving way.
We show that the selection of local features around pose keypoints has a positive effect on generalization performance.
We additionally make use of acceleration measured through wearable sensors for the same task, and present a multimodal approach combining both methods.
- Score: 8.710774926703321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing who is speaking in a crowded scene is a key challenge towards the
understanding of the social interactions going on within. Detecting speaking
status from body movement alone opens the door for the analysis of social
scenes in which personal audio is not obtainable. Video and wearable sensors
make it possible recognize speaking in an unobtrusive, privacy-preserving way.
When considering the video modality, in action recognition problems, a bounding
box is traditionally used to localize and segment out the target subject, to
then recognize the action taking place within it. However, cross-contamination,
occlusion, and the articulated nature of the human body, make this approach
challenging in a crowded scene. Here, we leverage articulated body poses for
subject localization and in the subsequent speech detection stage. We show that
the selection of local features around pose keypoints has a positive effect on
generalization performance while also significantly reducing the number of
local features considered, making for a more efficient method. Using two
in-the-wild datasets with different viewpoints of subjects, we investigate the
role of cross-contamination in this effect. We additionally make use of
acceleration measured through wearable sensors for the same task, and present a
multimodal approach combining both methods.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Detecting Events in Crowds Through Changes in Geometrical Dimensions of
Pedestrians [0.6390468088226495]
We examine three different scenarios of crowd behavior, containing both the cases where an event triggers a change in the behavior of the crowd and two video sequences where the crowd and its motion remain mostly unchanged.
With both the videos and the tracking of the individual pedestrians (performed in a pre-processed phase), we use Geomind to extract significant data about the scene, in particular, the geometrical features, personalities, and emotions of each person.
We then examine the output, seeking a significant change in the way each person acts as a function of the time, that could be used as a basis to identify events or to model realistic crowd
arXiv Detail & Related papers (2023-12-11T16:18:56Z) - Egocentric Auditory Attention Localization in Conversations [25.736198724595486]
We propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention.
Our approach leverages features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset.
arXiv Detail & Related papers (2023-03-28T14:52:03Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization [13.144367063836597]
We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results.
Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.
arXiv Detail & Related papers (2022-01-06T05:40:16Z) - Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture
Recognition [9.131161856493486]
We propose a novel end-to-end textbfRegional Attention Network (RAN), which is a fully Convolutional Neural Network (CNN)
Our regions consist of one or more consecutive cells and are adapted from the strategies used in computing HOG (Histogram of Oriented Gradient) descriptor.
The proposed approach outperforms the state-of-the-art by a considerable margin in different metrics.
arXiv Detail & Related papers (2021-01-17T10:14:28Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.