ActFormer: Scalable Collaborative Perception via Active Queries
- URL: http://arxiv.org/abs/2403.04968v1
- Date: Fri, 8 Mar 2024 00:45:18 GMT
- Title: ActFormer: Scalable Collaborative Perception via Active Queries
- Authors: Suozhi Huang, Juexiao Zhang, Yiming Li, Chen Feng
- Abstract summary: Collaborative perception leverages rich visual observations from multiple robots to extend a single robot's perception ability beyond its field of view.
We present ActFormer, a Transformer that learns bird's eye view (BEV) representations by using predefined BEV queries to interact with multi-robot multi-camera inputs.
Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the detection performance from 29.89% to 45.15% in terms of AP@0.7 with about 50% fewer queries.
- Score: 12.020585564801781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Collaborative perception leverages rich visual observations from multiple
robots to extend a single robot's perception ability beyond its field of view.
Many prior works receive messages broadcast from all collaborators, leading to
a scalability challenge when dealing with a large number of robots and sensors.
In this work, we aim to address \textit{scalable camera-based collaborative
perception} with a Transformer-based architecture. Our key idea is to enable a
single robot to intelligently discern the relevance of the collaborators and
their associated cameras according to a learned spatial prior. This proactive
understanding of the visual features' relevance does not require the
transmission of the features themselves, enhancing both communication and
computation efficiency. Specifically, we present ActFormer, a Transformer that
learns bird's eye view (BEV) representations by using predefined BEV queries to
interact with multi-robot multi-camera inputs. Each BEV query can actively
select relevant cameras for information aggregation based on pose information,
instead of interacting with all cameras indiscriminately. Experiments on the
V2X-Sim dataset demonstrate that ActFormer improves the detection performance
from 29.89% to 45.15% in terms of AP@0.7 with about 50% fewer queries,
showcasing the effectiveness of ActFormer in multi-agent collaborative 3D
object detection.
Related papers
- Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris.
Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models.
We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z) - IFTR: An Instance-Level Fusion Transformer for Visual Collaborative Perception [9.117534139771738]
Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving.
Current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images.
This work proposes an instance-level fusion transformer for visual collaborative perception.
arXiv Detail & Related papers (2024-07-13T11:38:15Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Polybot: Training One Policy Across Robots While Embracing Variability [70.74462430582163]
We propose a set of key design decisions to train a single policy for deployment on multiple robotic platforms.
Our framework first aligns the observation and action spaces of our policy across embodiments via utilizing wrist cameras.
We evaluate our method on a dataset collected over 60 hours spanning 6 tasks and 3 robots with varying joint configurations and sizes.
arXiv Detail & Related papers (2023-07-07T17:21:16Z) - PACT: Perception-Action Causal Transformer for Autoregressive Robotics
Pre-Training [25.50131893785007]
This work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot.
We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion.
We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously.
arXiv Detail & Related papers (2022-09-22T16:20:17Z) - CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse
Transformers [36.838065731893735]
CoBEVT is the first generic multi-agent perception framework that can cooperatively generate BEV map predictions.
CoBEVT achieves state-of-the-art performance for cooperative BEV semantic segmentation.
arXiv Detail & Related papers (2022-07-05T17:59:28Z) - Look Closer: Bridging Egocentric and Third-Person Views with
Transformers for Robotic Manipulation [15.632809977544907]
Learning to solve precision-based manipulation tasks from visual feedback could drastically reduce the engineering efforts required by traditional robot systems.
We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist.
To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism.
arXiv Detail & Related papers (2022-01-19T18:39:03Z) - CNN-based Omnidirectional Object Detection for HermesBot Autonomous
Delivery Robot with Preliminary Frame Classification [53.56290185900837]
We propose an algorithm for optimizing a neural network for object detection using preliminary binary frame classification.
An autonomous mobile robot with 6 rolling-shutter cameras on the perimeter providing a 360-degree field of view was used as the experimental setup.
arXiv Detail & Related papers (2021-10-22T15:05:37Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.