Scalable Object Detection in the Car Interior With Vision Foundation Models
- URL: http://arxiv.org/abs/2508.19651v1
- Date: Wed, 27 Aug 2025 07:58:57 GMT
- Title: Scalable Object Detection in the Car Interior With Vision Foundation Models
- Authors: Bálint Mészáros, Ahmet Firintepe, Sebastian Schmidt, Stephan Günnemann,
- Abstract summary: We propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding.<n>Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud.<n>To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.<n>Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL$_score$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%.
- Score: 42.958409172092225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework's potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL$_{score}$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL$_{SNR}$ three times higher than GPT-4o.
Related papers
- VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling [60.341503853471494]
We show that vision-language-action models degrade sharply under novel camera viewpoints and visual perturbations.<n>We propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates.
arXiv Detail & Related papers (2025-12-02T16:16:13Z) - An Analytical Framework to Enhance Autonomous Vehicle Perception for Smart Cities [1.9923531555025622]
There is a need to develop a model that accurately perceives multiple objects on the road and predicts the driver's perception to control the car's movements.<n>This article proposes a novel utility-based analytical model that enables perception systems of AVs to understand the driving environment.
arXiv Detail & Related papers (2025-10-15T07:34:22Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI [0.0]
This work introduces a vision-language framework to facilitate the understanding of and interaction with automotive UIs.<n>To support research in this field, AutomotiveUI-Bench-4K, an open-source dataset comprising 998 images with 4,208 annotations, is also released.<n>A Molmo-7B-based model is fine-tuned using Low-Rank Adaptation (LoRa), incorporating generated reasoning along with visual grounding and evaluation capabilities.
arXiv Detail & Related papers (2025-05-09T09:01:52Z) - GADS: A Super Lightweight Model for Head Pose Estimation [0.0]
Grouped Attention Deep Sets (GADS) is a novel architecture based on the Deep Set framework.<n>By grouping landmarks into regions, we reduce computational complexity.<n>Our model is $7.5times$ smaller and executes $25times$ faster than the current lightest state-of-the-art model.
arXiv Detail & Related papers (2025-04-22T09:53:25Z) - A Light Perspective for 3D Object Detection [46.23578780480946]
This paper introduces a novel approach that incorporates cutting-edge Deep Learning techniques into the feature extraction process.<n>Our model, NextBEV, surpasses established feature extractors like ResNet50 and MobileNetV3.<n>By fusing these lightweight proposals, we have enhanced the accuracy of the VoxelNet-based model by 2.93% and improved the F1-score of the PointPillar-based model by approximately 20%.
arXiv Detail & Related papers (2025-03-10T10:03:23Z) - Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for
Advanced Object Detection [55.2480439325792]
We present an in-depth evaluation of an object detection model that integrates the LSKNet backbone with the DiffusionDet head.
The proposed model achieves a mean average precision (MAP) of approximately 45.7%, which is a significant improvement.
This advancement underscores the effectiveness of the proposed modifications and sets a new benchmark in aerial image analysis.
arXiv Detail & Related papers (2023-11-21T19:49:13Z) - Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.