Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks
- URL: http://arxiv.org/abs/2507.19684v1
- Date: Fri, 25 Jul 2025 21:33:48 GMT
- Title: Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and Benchmarks
- Authors: Bermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang, Trinity Evans, Paige Tuttösí, Angelica Lim,
- Abstract summary: We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing.<n>The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels.<n>We evaluate CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing.
- Score: 0.5937476291232802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner's proficiency, using haptic signaling as a primary form of communication. While today's AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.
Related papers
- Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z) - Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues [19.675409379345172]
We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text.<n>Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language.
arXiv Detail & Related papers (2025-06-01T11:07:25Z) - The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion [46.01825432018138]
We propose a novel framework that unifies verbal and non-verbal language using multimodal language models.<n>Our model achieves state-of-the-art performance on co-speech gesture generation.<n>We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications.
arXiv Detail & Related papers (2024-12-13T19:33:48Z) - Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony [55.26315526382004]
We propose a novel framework, Combo, for co-speech holistic 3D human motion generation.
In particular, we identify that one fundamental challenge as the multiple-input-multiple-output nature of the generative model of interest.
Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion.
arXiv Detail & Related papers (2024-08-18T07:48:49Z) - Nonverbal Interaction Detection [83.40522919429337]
This work addresses a new challenge of understanding human nonverbal interaction in social contexts.
We contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups.
Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form individual, group, interaction> from images.
Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs.
arXiv Detail & Related papers (2024-07-11T02:14:06Z) - in2IN: Leveraging individual Information to Generate Human INteractions [29.495166514135295]
We introduce in2IN, a novel diffusion model for human-human motion generation conditioned on individual descriptions.
We also propose DualMDM, a model composition technique that combines the motions generated with in2IN and the motions generated by a single-person motion prior pre-trained on HumanML3D.
arXiv Detail & Related papers (2024-04-15T17:59:04Z) - Contact-aware Human Motion Generation from Textual Descriptions [57.871692507044344]
This paper addresses the problem of generating 3D interactive human motion from text.
We create a novel dataset named RICH-CAT, representing "Contact-Aware Texts"
We propose a novel approach named CATMO for text-driven interactive human motion synthesis.
arXiv Detail & Related papers (2024-03-23T04:08:39Z) - DisCo: Disentangled Control for Realistic Human Dance Generation [125.85046815185866]
We introduce DISCO, which includes a novel model architecture with disentangled control to improve the compositionality of dance synthesis.
DisCc can generate high-quality human dance images and videos with diverse appearances and flexible motions.
arXiv Detail & Related papers (2023-06-30T17:37:48Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.