Related papers: RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving

RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving

URL: http://arxiv.org/abs/2508.06529v1
Date: Sat, 02 Aug 2025 16:34:24 GMT
Title: RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving
Authors: Jiayuan Wang, Q. M. Jonathan Wu, Katsuya Suto, Ning Zhang,
Abstract summary: RMT-PPAD is a real-time, transformer-based multi-task model.<n>It jointly performs object detection, drivable area segmentation, and lane line segmentation.<n>The results show that RMT-PPAD consistently delivers stable performance.
Score: 18.945598464194607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/RMT-PPAD.

Related papers

Task-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases [3.4869850730657728]
We present a novel framework for dynamic LoRA adapter composition that leverages similarity retrieval in vector databases.<n>Our approach constructs a task-aware vector database by embedding training examples from 22 spanning commonsense reasoning, question answering, natural language inference, and sentiment analysis.<n>Our framework requires no additional retriever training, operates with frozen embeddings, and enables efficient, interpretable adapter composition.
arXiv Detail & Related papers (2026-02-01T22:20:04Z)
RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations [52.752467948588816]
We propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations.<n> RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks.<n>Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance
arXiv Detail & Related papers (2025-12-30T06:50:11Z)
Fast-COS: A Fast One-Stage Object Detector Based on Reparameterized Attention Vision Transformer for Autonomous Driving [3.617580194719686]
This paper introduces Fast-COS, a novel single-stage object detection framework crafted specifically for driving scenes.<n> RAViT achieves 81.4% Top-1 accuracy on the ImageNet-1K dataset.<n>It surpasses leading models in efficiency, delivering up to 75.9% faster GPU inference and 1.38 higher throughput on edge devices.
arXiv Detail & Related papers (2025-02-11T09:54:09Z)
Pilot: Building the Federated Multimodal Instruction Tuning Framework [79.56362403673354]
Our framework integrates two stages of "adapter on adapter" into the connector of the vision encoder and the LLM.<n>In stage 1, we extract task-specific features and client-specific features from visual information.<n>In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction.
arXiv Detail & Related papers (2025-01-23T07:49:24Z)
Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking [52.04679257903805]
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks.
arXiv Detail & Related papers (2024-07-19T07:48:45Z)
MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z)
RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything [117.02741621686677]
This work explores a novel real-time segmentation setting called real-time multi-purpose segmentation.<n>It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation.<n>We present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM)<n>It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding.
arXiv Detail & Related papers (2024-01-18T18:59:30Z)
Mobile-Seed: Joint Semantic Segmentation and Boundary Detection for Mobile Robots [17.90723909170376]
We introduce Mobile-Seed, a lightweight framework for simultaneous semantic segmentation and boundary detection. Our framework features a two-stream encoder, an active fusion decoder (AFD) and a dual-task regularization approach. Experiments on the Cityscapes dataset have shown that Mobile-Seed achieves notable improvement over the state-of-the-art (SOTA) baseline.
arXiv Detail & Related papers (2023-11-21T14:53:02Z)
You Only Look at Once for Real-time and Generic Multi-Task [20.61477620156465]
A-YOLOM is an adaptive, real-time, and lightweight multi-task model. We develop an end-to-end multi-task model with a unified and streamlined segmentation structure. We achieve competitive results on the BDD100k dataset.
arXiv Detail & Related papers (2023-10-02T21:09:43Z)
Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving [100.3848723827869]
We present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting. Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories. We bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving.
arXiv Detail & Related papers (2023-03-03T08:54:06Z)
Driver Maneuver Detection and Analysis using Time Series Segmentation and Classification [7.413735713939367]
This paper implements a methodology for automatically detecting vehicle maneuvers from vehicle telemetry data under naturalistic driving settings. Our objective is to develop an end-to-end pipeline for frame-by-frame annotation of naturalistic driving studies videos.
arXiv Detail & Related papers (2022-11-10T03:38:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.