LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition
- URL: http://arxiv.org/abs/2410.21108v2
- Date: Tue, 10 Dec 2024 05:09:16 GMT
- Title: LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition
- Authors: Naga Venkata Sai Raviteja Chappa, Khoa Luu,
- Abstract summary: LiGAR is a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition.
Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module.
LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics.
- Score: 9.103869144049014
- License:
- Abstract: Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state-of-the-art results with improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi-modal, multi-scale approach in advancing the field of group activity recognition.
Related papers
- Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning [51.54046200512198]
Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models.
A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation.
To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent.
arXiv Detail & Related papers (2025-01-25T14:24:50Z) - LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes [55.33167217384738]
LiMoE is a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning.
Our approach consists of three stages: Image-to-LiDAR Pretraining, Contrastive Mixture Learning (CML), and Semantic Mixture Supervision (SMS)
arXiv Detail & Related papers (2025-01-07T18:59:58Z) - Band Prompting Aided SAR and Multi-Spectral Data Fusion Framework for Local Climate Zone Classification [20.71392764471532]
Local climate zone (LCZ) classification is of great value for understanding the complex interactions between urban development and local climate.
Recent studies have increasingly focused on the fusion of synthetic aperture radar (SAR) and multi-spectral data to improve LCZ classification performance.
In this paper, a novel band prompting aided data fusion framework is proposed for LCZ classification, namely BP-LCZ.
The experimental results demonstrate the effectiveness and superiority of the proposed data fusion framework.
arXiv Detail & Related papers (2024-12-24T07:40:07Z) - Adapting to Non-Stationary Environments: Multi-Armed Bandit Enhanced Retrieval-Augmented Generation on Knowledge Graphs [23.357843519762483]
Recent studies have demonstrated that leveraging the Retrieval-Augmented Generation framework, combined with Knowledge Graphs, robustly enhances the reasoning capabilities of Large language models.
We introduce a Multi-objective Multi-Armed Bandit enhanced RAG framework, supported by multiple retrieval methods with diverse capabilities.
Our method significantly outperforms baseline methods in non-stationary settings while achieving state-of-the-art performance in stationary environments.
arXiv Detail & Related papers (2024-12-10T15:56:03Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition [12.382193259575805]
We propose a novel multi-modality co-learning (MMCL) framework for efficient skeleton-based action recognition.
Our MMCL framework engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference.
arXiv Detail & Related papers (2024-07-22T15:16:47Z) - Multi-Space Alignments Towards Universal LiDAR Segmentation [50.992103482269016]
M3Net is a one-of-a-kind framework for fulfilling multi-task, multi-dataset, multi-modality LiDAR segmentation.
We first combine large-scale driving datasets acquired by different types of sensors from diverse scenes.
We then conduct alignments in three spaces, namely data, feature, and label spaces, during the training.
arXiv Detail & Related papers (2024-05-02T17:59:57Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR
Perception [15.919789515451615]
We introduce a new LiDAR multi-task learning paradigm based on the transformer.
LiDARFormer exploits cross-task synergy to boost the performance of LiDAR perception tasks.
LiDARFormer is evaluated on the large-scale nuScenes and the Open datasets for both 3D detection and semantic segmentation tasks.
arXiv Detail & Related papers (2023-03-21T20:52:02Z) - Benchmarking the Robustness of LiDAR Semantic Segmentation Models [78.6597530416523]
In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions.
We propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy.
We design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications.
arXiv Detail & Related papers (2023-01-03T06:47:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.