Fugu-MT 論文翻訳(概要): Performance evaluation of deep learning models for image analysis: considerations for visual control and statistical metrics

論文の概要: Performance evaluation of deep learning models for image analysis: considerations for visual control and statistical metrics

arxiv url: http://arxiv.org/abs/2603.13557v1
Date: Fri, 13 Mar 2026 19:49:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.270897
Title: Performance evaluation of deep learning models for image analysis: considerations for visual control and statistical metrics
Title（参考訳）: 画像解析のためのディープラーニングモデルの性能評価--視覚制御と統計指標の検討
Authors: Christof A. Bertram, Jonas Ammeling, Alexander Bartel, Gillian Beamer, Marc Aubreville,
Abstract要約: 深層学習に基づく自動画像解析(DL-AIA)は、特徴量化に関連するタスクにおいて、訓練された病理医より優れていることが示されている。 DL-AIAツールの使用は、現在、プリンシプル研究から患者サンプルなどの定期的な応用まで拡張されている。 DL-AIAアプリケーションが安全で信頼性が高いことを保証するため、徹底的で客観的な一般化性能評価を行うことが重要である。
参考スコア（独自算出の注目度）: 38.007806456084296
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Deep learning-based automated image analysis (DL-AIA) has been shown to outperform trained pathologists in tasks related to feature quantification. Related to these capacities the use of DL-AIA tools is currently extending from proof-of-principle studies to routine applications such as patient samples (diagnostic pathology), regulatory safety assessment (toxicologic pathology), and recurrent research tasks. To ensure that DL-AIA applications are safe and reliable, it is critical to conduct a thorough and objective generalization performance assessment (i.e., the ability of the algorithm to accurately predict patterns of interest) and possibly evaluate model robustness (i.e., the algorithm's capacity to maintain predictive accuracy on images from different sources). In this article, we review the practices for performance assessment in veterinary pathology publications by which two approaches were identified: 1) Exclusive visual performance control (i.e. eyeballing of algorithmic predictions) plus validation of the models application utilizing secondary performance indices, and 2) Statistical performance control (alongside the other methods), which requires a dataset creation and separation of an hold-out test set prior to model training. This article compares the strengths and weaknesses of statistical and visual performance control methods. Furthermore, we discuss relevant considerations for rigorous statistical performance evaluation including metric selection, test dataset image composition, ground truth label quality, resampling methods such as bootstrapping, statistical comparison of multiple models, and evaluation of model stability. It is our conclusion that visual and statistical evaluation have complementary strength and a combination of both provides the greatest insight into the DL model's performance and sources of error.
Abstract（参考訳）: 深層学習に基づく自動画像解析(DL-AIA)は、特徴量化に関連するタスクにおいて、訓練された病理医より優れていることが示されている。これらの能力に関連して、DL-AIAツールの使用は、現在、実証研究から患者サンプル(診断病理)、規制安全評価(毒性病理)、再発研究タスクなどの定期的な応用まで拡張されている。 DL-AIAアプリケーションが安全で信頼性が高いことを保証するため、網羅的で客観的な一般化性能評価(すなわち、関心のパターンを正確に予測するアルゴリズムの能力)を行い、モデルロバスト性(すなわち、異なるソースの画像の予測精度を維持するアルゴリズムの能力)を評価することが重要である。本稿では,2つのアプローチが同定された獣医学出版物のパフォーマンス評価の実践について概説する。 1)排他的視覚性能制御(すなわちアルゴリズム予測の目玉化)及び二次性能指標を用いたモデル適用の検証 2) 統計的性能制御(他の方法に加えて)は、モデルトレーニングの前にデータセットの作成とテストセットの分離を必要とする。本稿では,統計的および視覚的性能制御手法の長所と短所を比較した。さらに,計量選択,テストデータセット画像合成,地中真理ラベルの品質,ブートストレッピングなどの再サンプリング手法,複数モデルの統計的比較,モデルの安定性評価など,厳密な統計性能評価について検討する。視覚的および統計的評価は相補的な強度を持ち、両者の組み合わせは、DLモデルの性能と誤りの原因について最も深い洞察を与えるという結論である。

関連論文リスト

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
本稿では,知識駆動型エージェント推論を用いて,データ駆動型静的予測を橋渡しするフレームワークSTARを提案する。 STARはスコアベースとランクベースの両方の基準線を一貫して上回ることを示す。
論文参考訳（メタデータ） (2026-02-12T16:30:07Z)
A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis [1.6690512882610855]
ディープラーニングモデルは、臨床外の生理的パラメータを継続的に監視するために使用することができる。実践的な測定シナリオに配備された場合、パフォーマンスが悪くなるリスクがあり、負の患者結果につながる。ここでは、2つの臨床関連予測タスクで訓練されたモデルに対して、8つの不確実性(UQ)技術を実装する。
論文参考訳（メタデータ） (2025-10-31T22:54:13Z)
ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE(Adaptive Resolution-Aware Scaling Evaluation)は、大規模推論モデルの試験時間スケーリングの有効性を評価するために設計された新しい尺度である。我々は、様々な領域にわたる最先端の推論モデルを評価する包括的な実験を行う。
論文参考訳（メタデータ） (2025-10-07T15:10:51Z)
Testing and Improving the Robustness of Amortized Bayesian Inference for Cognitive Models [0.5223954072121659]
汚染物質観測とアウトリーチは、認知モデルのパラメータを推定する際にしばしば問題を引き起こす。本研究では,アモルタイズされたベイズ推定を用いたパラメータ推定のロバスト性を検証・改善する。提案手法は実装が簡単で実用的であり,外乱検出や除去が困難な分野に適用可能である。
論文参考訳（メタデータ） (2024-12-29T21:22:24Z)
How Deep is your Guess? A Fresh Perspective on Deep Learning for Medical Time-Series Imputation [6.547981908229007]
アーキテクチャとフレームワークのバイアスがモデルのパフォーマンスにどのように影響するかを示します。実験では、プリプロセッシングと実装の選択に基づいて、最大20%の性能変化を示す。我々は,現在の深層計算法と医療要件の相違点を同定する。
論文参考訳（メタデータ） (2024-07-11T12:33:28Z)
Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
我々は,事前学習された視覚モデルからの深い特徴を統計的解析モデルと統合して,意見認識のないBIQA(OU-BIQA)を実現することを提案する。提案モデルは,最先端のBIQAモデルと比較して,人間の視覚的知覚との整合性に優れる。
論文参考訳（メタデータ） (2024-05-29T06:09:34Z)
Towards Better Modeling with Missing Data: A Contrastive Learning-based Visual Analytics Perspective [7.577040836988683]
データ不足は機械学習(ML)モデリングの課題となる可能性がある。現在のアプローチは、特徴計算とラベル予測に分類される。本研究は、観測データに欠落した値でモデル化するコントラスト学習フレームワークを提案する。
論文参考訳（メタデータ） (2023-09-18T13:16:24Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
人間の評価のための20世紀起源の理論である心理測定は、今日のAI評価における課題に対する強力な解決策になり得る、と我々は主張する。
論文参考訳（メタデータ） (2023-06-18T09:54:33Z)
Robustness and Generalization Performance of Deep Learning Models on Cyber-Physical Systems: A Comparative Study [71.84852429039881]
調査は、センサーの故障やノイズなど、様々な摂動を扱うモデルの能力に焦点を当てている。我々は,これらのモデルの一般化と伝達学習能力を,アウト・オブ・ディストリビューション(OOD)サンプルに公開することによって検証する。
論文参考訳（メタデータ） (2023-06-13T12:43:59Z)
Explaining medical AI performance disparities across sites with confounder Shapley value analysis [8.785345834486057]
マルチサイト評価は、このような格差を診断する鍵となる。本フレームワークは,各種類のバイアスが全体の性能差に与える影響を定量化する手法を提供する。本研究は, 深部学習モデルを用いて気胸の有無を検知し, その有用性を実証するものである。
論文参考訳（メタデータ） (2021-11-12T18:54:10Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。