Fugu-MT 論文翻訳(概要): Scalable Object Detection in the Car Interior With Vision Foundation Models

論文の概要: Scalable Object Detection in the Car Interior With Vision Foundation Models

arxiv url: http://arxiv.org/abs/2508.19651v1
Date: Wed, 27 Aug 2025 07:58:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-28 19:07:41.549912
Title: Scalable Object Detection in the Car Interior With Vision Foundation Models
Title（参考訳）: 視覚基礎モデルを用いた車室内のスケーラブル物体検出
Authors: Bálint Mészáros, Ahmet Firintepe, Sebastian Schmidt, Stephan Günnemann,
Abstract要約: 本研究では,内部シーン理解のための新しいオブジェクト検出・局所化(ODAL)フレームワークを提案する。当社のアプローチでは、分散アーキテクチャを通じてビジョン基盤モデルを活用し、オンボードとクラウドの間で計算タスクを分割する。モデル性能をベンチマークするために,検出と局所化の総合評価のための新しい指標であるOdaLbenchを紹介する。注目すべきは、我々の微調整したOdaL-LLaVAモデルがOdaL$_score$の89%を達成し、ベースライン性能が71%向上し、GPT-4oを20%近く上回ったことです。
参考スコア（独自算出の注目度）: 42.958409172092225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI tasks in the car interior like identifying and localizing externally introduced objects is crucial for response quality of personal assistants. However, computational resources of on-board systems remain highly constrained, restricting the deployment of such solutions directly within the vehicle. To address this limitation, we propose the novel Object Detection and Localization (ODAL) framework for interior scene understanding. Our approach leverages vision foundation models through a distributed architecture, splitting computational tasks between on-board and cloud. This design overcomes the resource constraints of running foundation models directly in the car. To benchmark model performance, we introduce ODALbench, a new metric for comprehensive assessment of detection and localization.Our analysis demonstrates the framework's potential to establish new standards in this domain. We compare the state-of-the-art GPT-4o vision foundation model with the lightweight LLaVA 1.5 7B model and explore how fine-tuning enhances the lightweight models performance. Remarkably, our fine-tuned ODAL-LLaVA model achieves an ODAL$_{score}$ of 89%, representing a 71% improvement over its baseline performance and outperforming GPT-4o by nearly 20%. Furthermore, the fine-tuned model maintains high detection accuracy while significantly reducing hallucinations, achieving an ODAL$_{SNR}$ three times higher than GPT-4o.
Abstract（参考訳）: 外部から導入されたオブジェクトの特定やローカライズといった車内AIタスクは、パーソナルアシスタントの応答品質に不可欠である。しかし、車載システムの計算資源は非常に制約を受けており、車両内でのそのようなソリューションの展開を制限している。この制限に対処するために,内部シーン理解のための新しいオブジェクト検出と位置決め(ODAL)フレームワークを提案する。当社のアプローチでは、分散アーキテクチャを通じてビジョン基盤モデルを活用し、オンボードとクラウドの間で計算タスクを分割する。この設計は、車内でファンデーションモデルを直接実行する際のリソース制約を克服する。モデル性能をベンチマークするために,検出とローカライゼーションを包括的に評価する新しい指標であるODALbenchを紹介した。現状のGPT-4o視覚基礎モデルと軽量LLaVA 1.57Bモデルを比較し,微調整によって軽量モデルの性能が向上する方法について検討する。注目すべきは、我々の微調整したOdaL-LLaVAモデルがOdaL$_{score}を89%で達成し、ベースライン性能よりも71%向上し、GPT-4oを20%近く上回ったことです。さらに、微調整モデルでは、高い検出精度を維持しながら幻覚を著しく低減し、GPT-4oの3倍のOdaL$_{SNR}を達成している。

論文の概要: Scalable Object Detection in the Car Interior With Vision Foundation Models

関連論文リスト