First Place Solution to the ECCV 2024 ROAD++ Challenge @ ROAD++ Atomic Activity Recognition 2024
- URL: http://arxiv.org/abs/2410.23092v1
- Date: Wed, 30 Oct 2024 15:06:58 GMT
- Title: First Place Solution to the ECCV 2024 ROAD++ Challenge @ ROAD++ Atomic Activity Recognition 2024
- Authors: Ruyang Li, Tengfei Zhang, Heng Zhang, Tiejun Liu, Yanwei Wang, Xuelei Li,
- Abstract summary: This report presents our team's technical solution for participating in Track 3 of the 2024 ECCV ROAD++ Challenge.
The task of Track 3 is atomic activity recognition, which aims to identify 64 types of atomic activities in road scenes based on video content.
Our approach primarily addresses the challenges of small objects, discriminating between single object and a group of objects, as well as model overfitting.
- Score: 5.674251666234644
- License:
- Abstract: This report presents our team's technical solution for participating in Track 3 of the 2024 ECCV ROAD++ Challenge. The task of Track 3 is atomic activity recognition, which aims to identify 64 types of atomic activities in road scenes based on video content. Our approach primarily addresses the challenges of small objects, discriminating between single object and a group of objects, as well as model overfitting in this task. Firstly, we construct a multi-branch activity recognition framework that not only separates different object categories but also the tasks of single object and object group recognition, thereby enhancing recognition accuracy. Subsequently, we develop various model ensembling strategies, including integrations of multiple frame sampling sequences, different frame sampling sequence lengths, multiple training epochs, and different backbone networks. Furthermore, we propose an atomic activity recognition data augmentation method, which greatly expands the sample space by flipping video frames and road topology, effectively mitigating model overfitting. Our methods rank first in the test set of Track 3 for the ROAD++ Challenge 2024, and achieve 69% mAP.
Related papers
- Improving the Multi-label Atomic Activity Recognition by Robust Visual Feature and Advanced Attention @ ROAD++ Atomic Activity Recognition 2024 [34.921509504848025]
Road++ Track3 proposes a multi-label atomic activity recognition task in traffic scenarios.
The robustness of visual feature extraction remains a key challenge.
The final mAP on the test set was 58%, which is 4% higher than the challenge baseline.
arXiv Detail & Related papers (2024-10-21T14:10:14Z) - 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition [51.96660522869841]
DailyDVS-200 is a benchmark dataset tailored for the event-based action recognition community.
It covers 200 action categories across real-world scenarios, recorded by 47 participants, and comprises more than 22,000 event sequences.
DailyDVS-200 is annotated with 14 attributes, ensuring a detailed characterization of the recorded actions.
arXiv Detail & Related papers (2024-07-06T15:25:10Z) - 3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation [63.199793919573295]
Video Object (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames.
Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance.
arXiv Detail & Related papers (2024-06-06T00:56:25Z) - Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to
1st VCL Challenge -- Multi-Task Robustness Track [31.754017006309564]
We propose a framework named UniNet that seamlessly combines various visual perception algorithms into a multi-task model.
Specifically, we choose DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance segmentation, and depth estimation tasks.
The final submission is a single model with InternImage-L backbone, and achieves a 49.6 overall score.
arXiv Detail & Related papers (2024-02-27T08:51:20Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - Self-Supervised Interactive Object Segmentation Through a
Singulation-and-Grasping Approach [9.029861710944704]
We propose a robot learning approach to interact with novel objects and collect each object's training label.
The Singulation-and-Grasping (SaG) policy is trained through end-to-end reinforcement learning.
Our system achieves 70% singulation success rate in simulated cluttered scenes.
arXiv Detail & Related papers (2022-07-19T15:01:36Z) - Tackling Background Distraction in Video Object Segmentation [7.187425003801958]
A video object segmentation (VOS) aims to densely track certain objects in videos.
One of the main challenges in this task is the existence of background distractors that appear similar to the target objects.
We propose three novel strategies to suppress such distractors.
Our model achieves a comparable performance to contemporary state-of-the-art approaches, even with real-time performance.
arXiv Detail & Related papers (2022-07-14T14:25:19Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - MultiBodySync: Multi-Body Segmentation and Motion Estimation via 3D Scan
Synchronization [61.015704878681795]
We present a novel, end-to-end trainable multi-body motion segmentation and rigid registration framework for 3D point clouds.
The two non-trivial challenges posed by this multi-scan multibody setting are.
guaranteeing correspondence and segmentation consistency across multiple input point clouds and.
obtaining robust motion-based rigid body segmentation applicable to novel object categories.
arXiv Detail & Related papers (2021-01-17T06:36:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.