PCIE_Interaction Solution for Ego4D Social Interaction Challenge
- URL: http://arxiv.org/abs/2505.24404v1
- Date: Fri, 30 May 2025 09:35:25 GMT
- Title: PCIE_Interaction Solution for Ego4D Social Interaction Challenge
- Authors: Kanokphan Lertniphonphan, Feng Chen, Junda Xu, Fengbu Lan, Jun Xie, Tao Zhang, Zhepeng Wang,
- Abstract summary: This report presents our team's PCIE_Interaction solution for the Ego4D Social Interaction Challenge at CVPR 2025.<n>The challenge requires accurate detection of social interactions between subjects and the camera wearer.<n>Our approach achieved 0.81 and 0.71 mean average precision (mAP) on the LAM and TTM challenges leader board.
- Score: 25.283193734091462
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report presents our team's PCIE_Interaction solution for the Ego4D Social Interaction Challenge at CVPR 2025, addressing both Looking At Me (LAM) and Talking To Me (TTM) tasks. The challenge requires accurate detection of social interactions between subjects and the camera wearer, with LAM relying exclusively on face crop sequences and TTM combining speaker face crops with synchronized audio segments. In the LAM track, we employ face quality enhancement and ensemble methods. For the TTM task, we extend visual interaction analysis by fusing audio and visual cues, weighted by a visual quality score. Our approach achieved 0.81 and 0.71 mean average precision (mAP) on the LAM and TTM challenges leader board. Code is available at https://github.com/KanokphanL/PCIE_Ego4D_Social_Interaction
Related papers
- Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z) - Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction [7.412918099791407]
Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech.<n>We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze.<n>We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy).
arXiv Detail & Related papers (2025-05-27T11:24:38Z) - Mind the Gap! Static and Interactive Evaluations of Large Audio Models [55.87220295533817]
Large Audio Models (LAMs) are designed to power voice-native experiences.<n>This study introduces an interactive approach to evaluate LAMs and collect 7,500 LAM interactions from 484 participants.
arXiv Detail & Related papers (2025-02-21T20:29:02Z) - PCIE_LAM Solution for Ego4D Looking At Me Challenge [25.029465595146533]
This report presents our solution for the Ego4D Looking At Me Challenge at CVPR2024.
The main goal of the challenge is to accurately determine if a person in the scene is looking at the camera wearer.
Our approach achieved the 1st position in the looking at me challenge with 0.81 mAP and 0.93 accuracy rate.
arXiv Detail & Related papers (2024-06-18T02:16:32Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - Interactive Conversational Head Generation [68.76774230274076]
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation.
The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications.
arXiv Detail & Related papers (2023-07-05T08:06:26Z) - On the Linguistic and Computational Requirements for Creating
Face-to-Face Multimodal Human-Machine Interaction [0.0]
We videorecorded thirty-four human-avatar interactions, performed complete linguistic microanalysis on video excerpts, and marked all the occurrences of multimodal actions and events.
The data show evidence that double-loop feedback is established during a face-to-face conversation.
We propose that knowledge from Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among others, should be incorporated into the ones used for describing human-machine multimodal interactions.
arXiv Detail & Related papers (2022-11-24T21:17:36Z) - Face-to-Face Contrastive Learning for Social Intelligence
Question-Answering [55.90243361923828]
multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics.
We propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions.
We experimentally evaluated the challenging Social-IQ dataset and show state-of-the-art results.
arXiv Detail & Related papers (2022-07-29T20:39:44Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z) - Modeling Cross-view Interaction Consistency for Paired Egocentric
Interaction Recognition [16.094976277810556]
Paired egocentric interaction recognition (PEIR) is the task to collaboratively recognize the interactions between two persons with the videos in their corresponding views.
We propose to build the relevance between two views using biliear pooling, which capture the consistency of two views in feature-level.
Experimental results on dataset PEV shows the superiority of the proposed methods on the task PEIR.
arXiv Detail & Related papers (2020-03-24T05:05:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.