Related papers: PCIE_LAM Solution for Ego4D Looking At Me Challenge

PCIE_LAM Solution for Ego4D Looking At Me Challenge

URL: http://arxiv.org/abs/2406.12211v1
Date: Tue, 18 Jun 2024 02:16:32 GMT
Title: PCIE_LAM Solution for Ego4D Looking At Me Challenge
Authors: Kanokphan Lertniphonphan, Jun Xie, Yaqing Meng, Shijing Wang, Feng Chen, Zhepeng Wang,
Abstract summary: This report presents our solution for the Ego4D Looking At Me Challenge at CVPR2024. The main goal of the challenge is to accurately determine if a person in the scene is looking at the camera wearer. Our approach achieved the 1st position in the looking at me challenge with 0.81 mAP and 0.93 accuracy rate.
Score: 25.029465595146533
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This report presents our team's 'PCIE_LAM' solution for the Ego4D Looking At Me Challenge at CVPR2024. The main goal of the challenge is to accurately determine if a person in the scene is looking at the camera wearer, based on a video where the faces of social partners have been localized. Our proposed solution, InternLSTM, consists of an InternVL image encoder and a Bi-LSTM network. The InternVL extracts spatial features, while the Bi-LSTM extracts temporal features. However, this task is highly challenging due to the distance between the person in the scene and the camera movement, which results in significant blurring in the face image. To address the complexity of the task, we implemented a Gaze Smoothing filter to eliminate noise or spikes from the output. Our approach achieved the 1st position in the looking at me challenge with 0.81 mAP and 0.93 accuracy rate. Code is available at https://github.com/KanokphanL/Ego4D_LAM_InternLSTM

Related papers

PCIE_Interaction Solution for Ego4D Social Interaction Challenge [25.283193734091462]
This report presents our team's PCIE_Interaction solution for the Ego4D Social Interaction Challenge at CVPR 2025.<n>The challenge requires accurate detection of social interactions between subjects and the camera wearer.<n>Our approach achieved 0.81 and 0.71 mean average precision (mAP) on the LAM and TTM challenges leader board.
arXiv Detail & Related papers (2025-05-30T09:35:25Z)
Social EgoMesh Estimation [7.021561988248192]
We propose a novel framework for Socialcentric Estimation of body MEshes (SEE-ME) Our approach is the first to estimate the wearer's mesh using only a latent probabilistic diffusion model. Overall, SEE-ME surpasses the current best technique, reducing the pose estimation error (MPJPE) by 53%.
arXiv Detail & Related papers (2024-11-07T10:28:49Z)
AIM 2024 Sparse Neural Rendering Challenge: Methods and Results [64.19942455360068]
This paper reviews the challenge on Sparse Neural Rendering that was part of the Advances in Image Manipulation (AIM) workshop, held in conjunction with ECCV 2024. The challenge aims at producing novel camera view synthesis of diverse scenes from sparse image observations. Participants are asked to optimise objective fidelity to the ground-truth images as measured via the Peak Signal-to-Noise Ratio (PSNR) metric.
arXiv Detail & Related papers (2024-09-23T14:17:40Z)
PCIE_EgoHandPose Solution for EgoExo4D Hand Pose Challenge [12.31892993103657]
The main goal of the challenge is to accurately estimate hand poses, which involve 21 3D joints, using an RGB egocentric video image. To handle the complexity of the task, we propose the Hand Pose Vision Transformer (HP-ViT) The HP-ViT comprises a ViT backbone and transformer head to estimate joint positions in 3D, utilizing MPJPE and RLE loss function. Our approach achieved the 1st position in the Hand Pose challenge with 25.51 MPJPE and 8.49 PA-MPJPE.
arXiv Detail & Related papers (2024-06-18T02:41:32Z)
1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation [81.50620771207329]
We investigate the effectiveness of static-dominant data and frame sampling on referring video object segmentation (RVOS) Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge.
arXiv Detail & Related papers (2024-06-11T08:05:26Z)
Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization [38.64540967776744]
Diff2Lip is an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets.
arXiv Detail & Related papers (2023-08-18T17:59:40Z)
MIPI 2023 Challenge on Nighttime Flare Removal: Methods and Results [88.0792325532059]
We summarize and review the Nighttime Flare Removal track on MIPI 2023. 120 participants were successfully registered, and 11 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal.
arXiv Detail & Related papers (2023-05-23T07:34:49Z)
EgoLocate: Real-time Motion Capture, Localization, and Mapping with Sparse Body-mounted Sensors [74.1275051763006]
We develop a system that simultaneously performs human motion capture (mocap), localization, and mapping in real time from sparse body-mounted sensors. Our technique is largely improved by our technique, compared with the state of the art of the two fields.
arXiv Detail & Related papers (2023-05-02T16:56:53Z)
NTIRE 2022 Challenge on High Dynamic Range Imaging: Methods and Results [173.32437855731752]
The challenge was part of the New Trends in Image Restoration and Enhancement (NTIRE) workshop, held in conjunction with CVPR 2022. The challenge aims at estimating an HDR image from multiple respective low dynamic range (LDR) observations.
arXiv Detail & Related papers (2022-05-25T10:20:06Z)
A Simple Baseline for Pose Tracking in Videos of Crowded Scenes [130.84731947842664]
How to track the human pose in crowded and complex environments has not been well addressed. We use a multi-object tracking method to assign human ID to each bounding box generated by the detection model. At last, optical flow is used to take advantage of the temporal information in the videos and generate the final pose tracking result.
arXiv Detail & Related papers (2020-10-16T13:06:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.