SocialGesture: Delving into Multi-person Gesture Understanding
- URL: http://arxiv.org/abs/2504.02244v1
- Date: Thu, 03 Apr 2025 03:21:06 GMT
- Title: SocialGesture: Delving into Multi-person Gesture Understanding
- Authors: Xu Cao, Pranav Virupaksha, Wenqi Jia, Bolin Lai, Fiona Ryan, Sangmin Lee, James M. Rehg,
- Abstract summary: SocialGesture is the first large-scale dataset specifically designed for multi-person gesture analysis.<n>SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks.
- Score: 24.407821757676615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at huggingface.co/datasets/IrohXu/SocialGesture.
Related papers
- SocialGen: Modeling Multi-Human Social Interaction with Language Models [30.068879125411726]
We introduce SocialGen, the first unified motion-language model capable of modeling interaction behaviors among varying numbers of individuals.<n>Unlike prior methods that are limited to two-person interactions, we propose a novel social motion representation that supports tokenizing the motions of an arbitrary number of individuals.<n>This alignment enables the model to leverage rich, pretrained linguistic knowledge to better understand and reason about human social behaviors.
arXiv Detail & Related papers (2025-03-28T22:57:25Z) - Survey on Hand Gesture Recognition from Visual Input [2.1591725778863555]
Hand gesture recognition has become an important research area, driven by the growing demand for human-computer interaction.
There are few surveys that comprehensively cover recent research developments, available solutions, and benchmark datasets.
This survey addresses this gap by examining the latest advancements in hand gesture and 3D hand pose recognition from various types of camera input data.
arXiv Detail & Related papers (2025-01-21T09:23:22Z) - From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition [59.57095498284501]
We propose a novel approach that recognizes textbfContextual textbfSocial textbfRelationships (textbfConSoR) from a social cognitive perspective.
We construct social-aware descriptive language prompts with social relationships for each image.
Impressively, ConSoR outperforms previous methods with a 12.2% gain on the People-in-Social-Context (PISC) dataset and a 9.8% increase on the People-in-Photo-Album (PIPA) benchmark.
arXiv Detail & Related papers (2024-06-12T16:02:28Z) - Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations [20.848802791989307]
We introduce three new challenges to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction.
We propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances.
Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions.
arXiv Detail & Related papers (2024-03-04T14:46:58Z) - SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z) - A Comprehensive Review of Data-Driven Co-Speech Gesture Generation [11.948557523215316]
The automatic generation of such co-speech gestures is a long-standing problem in computer animation.
Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion.
This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models.
arXiv Detail & Related papers (2023-01-13T00:20:05Z) - Face-to-Face Contrastive Learning for Social Intelligence
Question-Answering [55.90243361923828]
multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics.
We propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions.
We experimentally evaluated the challenging Social-IQ dataset and show state-of-the-art results.
arXiv Detail & Related papers (2022-07-29T20:39:44Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - Didn't see that coming: a survey on non-verbal social human behavior
forecasting [47.99589136455976]
Non-verbal social human behavior forecasting has increasingly attracted the interest of the research community in recent years.
Its direct applications to human-robot interaction and socially-aware human motion generation make it a very attractive field.
We define the behavior forecasting problem for multiple interactive agents in a generic way that aims at unifying the fields of social signals prediction and human motion forecasting.
arXiv Detail & Related papers (2022-03-04T18:25:30Z) - PHASE: PHysically-grounded Abstract Social Events for Machine Social
Perception [50.551003004553806]
We create a dataset of physically-grounded abstract social events, PHASE, that resemble a wide range of real-life social interactions.
Phase is validated with human experiments demonstrating that humans perceive rich interactions in the social events.
As a baseline model, we introduce a Bayesian inverse planning approach, SIMPLE, which outperforms state-of-the-art feed-forward neural networks.
arXiv Detail & Related papers (2021-03-02T18:44:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.