Related papers: Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

URL: http://arxiv.org/abs/2508.11584v2
Date: Mon, 18 Aug 2025 05:11:18 GMT
Title: Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks
Authors: Jakub Łucki, Jonathan Becktor, Georgios Georgakis, Rob Royce, Shehryar Khattak,
Abstract summary: Visual Perception Engine (VPEngine) is a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining developer accessibility.<n>Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, across multiple specialized task-specific model heads running in parallel.<n>Our example implementation demonstrates end-to-end real-time performance at $geq$50 Hz on NVIDIA Jetson Orin AGX forRT optimized models.
Score: 6.943057640797408
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, without any unnecessary GPU-CPU memory transfers, across multiple specialized task-specific model heads running in parallel. This design eliminates the computational redundancy inherent in feature extraction component when deploying traditional sequential models while enabling dynamic task prioritization based on application demands. We demonstrate our framework's capabilities through an example implementation using DINOv2 as the foundation model with multiple task (depth, object detection and semantic segmentation) heads, achieving up to 3x speedup compared to sequential execution. Building on CUDA Multi-Process Service (MPS), VPEngine offers efficient GPU utilization and maintains a constant memory footprint while allowing per-task inference frequencies to be adjusted dynamically during runtime. The framework is written in Python and is open source with ROS2 C++ (Humble) bindings for ease of use by the robotics community across diverse robotic platforms. Our example implementation demonstrates end-to-end real-time performance at $\geq$50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimized models.

Related papers

HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving [47.368036613468455]
We present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving.<n>Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames.<n>Our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module.
arXiv Detail & Related papers (2025-11-10T13:49:59Z)
M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception [4.329662126907974]
Multi-Mono-Hydra (M2H) is a novel multi-task learning framework for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image.<n>Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment.
arXiv Detail & Related papers (2025-10-20T10:03:31Z)
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks. RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model. Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z)
HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras [45.739224968302565]
We present an end-to-end framework named HENet for multi-task 3D perception. Specifically, we propose a hybrid image encoding network, using a large image encoder for short-term frames and a small image encoder for long-term temporal frames. According to the characteristics of each perception task, we utilize BEV features of different grid sizes, independent BEV encoders, and task decoders for different tasks.
arXiv Detail & Related papers (2024-04-03T07:10:18Z)
GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z)
Multi-task Learning with 3D-Aware Regularization [55.97507478913053]
We propose a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space. We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance.
arXiv Detail & Related papers (2023-10-02T08:49:56Z)
A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks. These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z)
Fast GraspNeXt: A Fast Self-Attention Neural Network Architecture for Multi-task Learning in Computer Vision Tasks for Robotic Grasping on the Edge [80.88063189896718]
High architectural and computational complexity can result in poor suitability for deployment on embedded devices. Fast GraspNeXt is a fast self-attention neural network architecture tailored for embedded multi-task learning in computer vision tasks for robotic grasping.
arXiv Detail & Related papers (2023-04-21T18:07:14Z)
ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills [24.150758623016195]
We present ManiSkill2, the next generation of the SAPIEN ManiSkill benchmark for generalizable manipulation skills. ManiSkill2 includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames. It defines a unified interface and evaluation protocol to support a wide range of algorithms. It empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS.
arXiv Detail & Related papers (2023-02-09T14:24:01Z)
MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks. Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z)
Efficient Multi-Organ Segmentation Using SpatialConfiguration-Net with Low GPU Memory Requirements [8.967700713755281]
In this work, we employ a multi-organ segmentation model based on the SpatialConfiguration-Net (SCN) We modified the architecture of the segmentation model to reduce its memory footprint without drastically impacting the quality of the predictions. Lastly, we implemented a minimal inference script for which we optimized both, execution time and required GPU memory.
arXiv Detail & Related papers (2021-11-26T17:47:10Z)
Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach [16.702537371391053]
This article presents an automatic approach to derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. Compared to the single-stream version, our approach achieves a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively.
arXiv Detail & Related papers (2020-03-05T21:18:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.