Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models
- URL: http://arxiv.org/abs/2410.00309v1
- Date: Tue, 1 Oct 2024 01:14:24 GMT
- Title: Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models
- Authors: Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy,
- Abstract summary: Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME)
We introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes.
This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME.
- Score: 5.541130887628606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data. Addressing these challenges, we introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes. This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME. Our Ask Pose Unite (APU) dataset, comprising over 6.2k human mesh pairs in contact covering diverse interaction types, is curated from images depicting naturalistic person-to-person scenes. We empirically show that using our dataset to train a diffusion-based contact prior, used as guidance during optimization, improves mesh estimation on unseen interactions. Our work addresses longstanding challenges of data scarcity for close interactions in HME enhancing the field's capabilities of handling complex interaction scenarios.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - FedNE: Surrogate-Assisted Federated Neighbor Embedding for Dimensionality Reduction [47.336599393600046]
textscFedNE is a novel approach that integrates the textscFedAvg framework with the contrastive NE technique.
We conduct comprehensive experiments on both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-09-17T19:23:24Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Scaling Up Dynamic Human-Scene Interaction Modeling [58.032368564071895]
TRUMANS is the most comprehensive motion-captured HSI dataset currently available.
It intricately captures whole-body human motions and part-level object dynamics.
We devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length.
arXiv Detail & Related papers (2024-03-13T15:45:04Z) - Ins-HOI: Instance Aware Human-Object Interactions Recovery [44.02128629239429]
We propose an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework.
Ins-HOI supports instance-level reconstruction and provides reasonable and realistic invisible contact surfaces.
We collect a large-scale, high-fidelity 3D scan dataset, including 5.2k high-quality scans with real-world human-chair and hand-object interactions.
arXiv Detail & Related papers (2023-12-15T09:30:47Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot
Interactions [23.296139146133573]
We present a large-scale dataset, invig, for interactive visual grounding under language ambiguity.
Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues.
To the best of our knowledge, the invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding.
arXiv Detail & Related papers (2023-10-18T17:57:05Z) - Joint-Relation Transformer for Multi-Person Motion Prediction [79.08243886832601]
We propose the Joint-Relation Transformer to enhance interaction modeling.
Our method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and 17.8%/12.0% improvement of 3s MPJPE.
arXiv Detail & Related papers (2023-08-09T09:02:47Z) - RobustFusion: Robust Volumetric Performance Reconstruction under
Human-object Interactions from Monocular RGBD Stream [27.600873320989276]
High-quality 4D reconstruction of human performance with complex interactions to various objects is essential in real-world scenarios.
Recent advances still fail to provide reliable performance reconstruction.
We propose RobustFusion, a robust volumetric performance reconstruction system for human-object interaction scenarios.
arXiv Detail & Related papers (2021-04-30T08:41:45Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Mutual Graph Learning for Camouflaged Object Detection [31.422775969808434]
A major challenge is that intrinsic similarities between foreground objects and background surroundings make the features extracted by deep model indistinguishable.
We design a novel Mutual Graph Learning model, which generalizes the idea of conventional mutual learning from regular grids to the graph domain.
In contrast to most mutual learning approaches that use a shared function to model all between-task interactions, MGL is equipped with typed functions for handling different complementary relations.
arXiv Detail & Related papers (2021-04-03T10:14:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.