Fugu-MT 論文翻訳(概要): When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

論文の概要: When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

arxiv url: http://arxiv.org/abs/2510.11302v1
Date: Mon, 13 Oct 2025 11:48:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.345916
Title: When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
Title（参考訳）: トレーニングの指導はいつ終わるか : 視覚・言語モデルにおける物体検出の隠れ経済
Authors: Samer Al-Hamadani,
Abstract要約: 本稿では、教師付き検出とゼロショットVLM推論を比較した最初の総合的コスト効率解析について述べる。監督されたYOLOは、標準カテゴリでゼロショットのGeminiに対して、91.2%の精度で68.5%の精度を達成した。この利点は、投資額が5500万件を超えることを正当化し、1年間に15万1000枚の画像に匹敵する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Object detection systems have traditionally relied on supervised learning with manually annotated bounding boxes, achieving high accuracy at the cost of substantial annotation investment. The emergence of Vision-Language Models (VLMs) offers an alternative paradigm enabling zero-shot detection through natural language queries, eliminating annotation requirements but operating with reduced accuracy. This paper presents the first comprehensive cost-effectiveness analysis comparing supervised detection (YOLO) with zero-shot VLM inference (Gemini Flash 2.5). Through systematic evaluation on 1,000 stratified COCO images and 200 diverse product images spanning consumer electronics and rare categories, combined with detailed Total Cost of Ownership modeling, we establish quantitative break-even thresholds governing architecture selection. Our findings reveal that supervised YOLO achieves 91.2% accuracy versus 68.5% for zero-shot Gemini on standard categories, representing a 22.7 percentage point advantage that costs $10,800 in annotation for 100-category systems. However, this advantage justifies investment only beyond 55 million inferences, equivalent to 151,000 images daily for one year. Zero-shot Gemini demonstrates 52.3% accuracy on diverse product categories (ranging from highly web-prevalent consumer electronics at 75-85% to rare specialized equipment at 25-40%) where supervised YOLO achieves 0% due to architectural constraints preventing detection of untrained classes. Cost per Correct Detection analysis reveals substantially lower per-detection costs for Gemini ($0.00050 vs $0.143) at 100,000 inferences despite accuracy deficits. We develop decision frameworks demonstrating that optimal architecture selection depends critically on deployment volume, category stability, budget constraints, and accuracy requirements rather than purely technical performance metrics.
Abstract（参考訳）: オブジェクト検出システムは従来,手作業による注釈付き境界ボックスによる教師あり学習に頼ってきた。 VLM(Vision-Language Models)の出現は、自然言語クエリによるゼロショット検出を可能にする代替パラダイムを提供する。本稿では,教師付き検出 (YOLO) とゼロショットVLM推論 (Gemini Flash 2.5) を比較検討する。消費者電子製品と稀なカテゴリーにまたがる1,000枚の階層化されたCOCO画像と200個の多彩な製品画像の体系的評価と、オーナシップの詳細な総コストのモデリングを組み合わせることにより、アーキテクチャ選択を規定する定量的なブレークフェアしきい値を確立する。調査の結果,標準カテゴリにおけるゼロショットジェミニの精度は91.2%,ゼロショットジェミニでは68.5%,100カテゴリシステムでは10,800ドルという22.7%のアドバンテージが得られた。しかし、この利点は投資額が5500万件を超えることを正当化し、1年間に15万1000枚の画像に匹敵する。ゼロショットのジェミニは、さまざまな製品カテゴリー(75～85%のウェブ家電から25～40%のレアな特殊機器まで)の52.3%の精度を示し、そこでは、教師付きYOLOが、訓練されていないクラスの検出を防ぐためのアーキテクチャ上の制約のために0%を達成した。精度の低下にもかかわらず、10万の推論でGemini(0.00050対0.143)の検出コストは大幅に低下している。最適なアーキテクチャの選択は、純粋に技術的なパフォーマンス指標ではなく、デプロイメントのボリューム、カテゴリの安定性、予算の制約、正確さの要求に大きく依存する。

関連論文リスト

The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs [3.9977256267361754]
そこで本研究では,日本人児童のライドルから構築した費用効果評価指標であるNazonazoについて紹介する。 GPT-5以外のモデルは人間の性能に匹敵せず、平均精度は52.9%である。
論文参考訳（メタデータ） (2025-09-18T07:50:04Z)
ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning [57.67273340380651]
実験の結果,ASDAモデルは複数のベンチマークでSOTA(State-of-the-art)性能を達成できた。これらの結果は、ASDAの音声タスクにおける有効性を強調し、より広範なアプリケーションへの道を開いた。
論文参考訳（メタデータ） (2025-07-03T14:29:43Z)
Technical report on label-informed logit redistribution for better domain generalization in low-shot classification with foundation models [3.938980910007962]
信頼度校正は、基礎モデルに基づく現実世界の意思決定システムにおいて、新たな課題である。本研究では,微調整の際,不正分類を罰する損失目標に組み込んだペナルティを提案する。 CMP(textitconfidence misalignment penalty)と呼ぶ。
論文参考訳（メタデータ） (2025-01-29T11:54:37Z)
FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids [53.2306792009435]
FaultGuardは、障害タイプとゾーン分類のための最初のフレームワークであり、敵攻撃に耐性がある。本稿では,ロバスト性を高めるために,低複雑性故障予測モデルとオンライン逆行訓練手法を提案する。本モデルでは,耐故障予測ベンチマークの最先端を最大0.958の精度で上回っている。
論文参考訳（メタデータ） (2024-03-26T08:51:23Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
提案するMAD-Benchは,既存のオブジェクト,オブジェクト数,空間関係などの5つのカテゴリに分割した1000の試験サンプルを含むベンチマークである。我々は,GPT-4v,Reka,Gemini-Proから,LLaVA-NeXTやMiniCPM-Llama3といったオープンソースモデルに至るまで,一般的なMLLMを包括的に分析する。 GPT-4oはMAD-Bench上で82.82%の精度を達成するが、実験中の他のモデルの精度は9%から50%である。
論文参考訳（メタデータ） (2024-02-20T18:31:27Z)
Investigating the Limitation of CLIP Models: The Worst-Performing Categories [53.360239882501325]
コントラスト言語-画像事前学習(CLIP)は、自然言語を視覚概念に統合する基礎モデルを提供する。通常、十分に設計されたテキストプロンプトによって、多くの領域で満足な全体的な精度が達成できると期待されている。しかし、最悪のカテゴリにおけるパフォーマンスは、全体的なパフォーマンスよりも著しく劣っていることがわかった。
論文参考訳（メタデータ） (2023-10-05T05:37:33Z)
BEA: Revisiting anchor-based object detection DNN using Budding Ensemble Architecture [8.736601342033431]
Budding Ensemble Architecture(BEA)は、アンカーベースのオブジェクト検出モデルのための、新しい縮小アンサンブルアーキテクチャである。 BEAにおける損失関数は、信頼性スコアの校正を改善し、不確かさを低減させる。
論文参考訳（メタデータ） (2023-09-14T21:54:23Z)
Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning [79.43940012723539]
ADCLRは、正確で高密度な視覚表現を学習するための自己教師型学習フレームワークである。提案手法は, コントラッシブな手法のための新しい最先端性能を実現する。
論文参考訳（メタデータ） (2023-06-23T07:38:09Z)
What Can We Learn From The Selective Prediction And Uncertainty Estimation Performance Of 523 Imagenet Classifiers [15.929238800072195]
本稿では,既存の523の事前学習深層画像ネット分類器の選択的予測と不確実性評価性能について述べる。蒸留法に基づくトレーニング体制は、他のトレーニング方式よりも常により良い不確実性推定を導出することを発見した。例えば、ImageNetでは前例のない99%のトップ1選択精度を47%で発見しました。
論文参考訳（メタデータ） (2023-02-23T09:25:28Z)
Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection [85.53263670166304]
一段検出器は基本的に、物体検出を密度の高い分類と位置化として定式化する。 1段検出器の最近の傾向は、局所化の質を推定するために個別の予測分岐を導入することである。本稿では, 上記の3つの基本要素, 品質推定, 分類, ローカライゼーションについて述べる。
論文参考訳（メタデータ） (2020-06-08T07:24:33Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。