Fugu-MT 論文翻訳(概要): Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

論文の概要: Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

arxiv url: http://arxiv.org/abs/2606.07394v1
Date: Fri, 05 Jun 2026 15:32:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.829357
Title: Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation
Title（参考訳）: Mind the Gap: ビデオインスタンスセグメンテーションにおけるパフォーマンスのボトルネックを解消する
Authors: Danial Hamdi, Fardin Ayar, Mahdi Javanmardi,
Abstract要約: ビデオインスタンス(VIS)分類、セグメンテーション、追跡目的を共同で評価する。アルゴリズム線形プログラム(ILP)としてのアイデンティティとクラス割り当てを定式化する診断フレームワークを導入する。 TrackLensも導入しています。これはスケールを観測可能なクエリレベルの障害モードに変換するビジュアルツールです。
参考スコア（独自算出の注目度）: 0.34410212782758043
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long-term temporal association.
Abstract（参考訳）: ビデオインスタンスセグメンテーション(VIS)では、分類、セグメンテーション、トラッキングの目的が共同で評価されるが、パフォーマンス損失に対する個々の貢献は不透明である。 Integer Linear Program (ILP) としてアイデンティティとクラス割り当てを定式化する診断フレームワークを導入し,各エラーソースを階層的に分離するモデルに依存しないオラクルを生成する。 YouTube-VIS 2019/2021のオンラインパラダイムとオフラインパラダイムにまたがる7つのVISメソッドと、OVISの診断サブセットに適用すると、一貫した画像が明らかになる。追跡不安定性はオンライン手法にとって重要なボトルネックであり、ビデオの長さとインスタンス密度によってギャップは20APを超えている。セマンティック分類は標準ベンチマークに有意義に寄与するが、その影響は追跡が最も失敗する場所で無視される。強いバックボーンはデフォルトスコアを大幅に引き上げるが、AP追跡ギャップは大部分が無傷であり、時間的不安定性は純粋に表現的ではなくアルゴリズム的であることを確認した。オラクルを補完するために、ギャップの規模を観測可能なクエリレベルの障害モードに変換するビジュアルツールであるTrackLensを紹介します。これらのツールが組み合わさって、VISの中核的課題である堅牢な長期的関連をターゲットとする体系的な基盤を提供する。

論文の概要: Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

関連論文リスト