Fugu-MT 論文翻訳(概要): Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

論文の概要: Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

arxiv url: http://arxiv.org/abs/2603.27556v1
Date: Sun, 29 Mar 2026 07:39:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.02457
Title: Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method
Title（参考訳）: ドメイン一般化オープンボキャブラリオブジェクト検出に向けて:プログレッシブなドメイン不変なクロスモーダルアライメント法
Authors: Xiaoran Xu, Xiaoshan Yang, Jiangang Yang, Yifan Xu, Jian Liu, Changsheng Xu,
Abstract要約: Open-Vocabulary Object Detectionは、新しいカテゴリへの一般化において大きな成功を収めた。我々は、OVODパラダイムの原則的な見直しを行い、根本的な脆弱性を明らかにする。 PICA(Progressive Domain-invariant Cross-Modal Alignment)を提案する。
参考スコア（独自算出の注目度）: 59.30562121800656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and signal strength curriculum. It builds adaptive pseudo-word prototypes, refined via sample confidence and visual consistency, to enforce invariant cross-domain modality alignment. Our findings suggest that OVOD's robustness to domain shifts is intrinsically linked to the stability of the latent cross-modal alignment space. Our work provides both a challenging benchmark and a new perspective on building truly generalizable open-vocabulary systems that extend beyond static laboratory conditions.
Abstract（参考訳）: Open-Vocabulary Object Detection (OVOD)は、新しいカテゴリへの一般化において大きな成功を収めた。しかし、この成功はしばしばドメインの定常性の暗黙的な仮定に依存している。本研究では,OVODパラダイムの原理的再検討を行い,分布シフトが発生すると,視覚多様体とテキスト埋め込みとの間の脆弱な結合が生じるという根本的な脆弱性を明らかにする。まず、ドメイン一般化オープン語彙オブジェクト検出(DG-OVOD)を体系的に定式化する。経験的分析により、視覚的なシフトは単にノイズを付加するだけでなく、新しいカテゴリーの視覚信号がそれらの意味的アンカーから切り離される潜在モーダル空間が崩壊することを示した。これらの知見に触発され、我々はプログレッシブ・ドメイン不変のクロスモーダルアライメント(PICA)を提案する。 PICAは、多段階のあいまいさと信号強度のカリキュラムを導入することで、均一なトレーニングから出発する。適応的な擬単語のプロトタイプを構築し、サンプルの信頼と視覚的一貫性によって洗練され、不変なドメイン間のモダリティアライメントを強制する。以上の結果から,OVODのドメインシフトに対する頑健性は,潜伏するクロスモーダルアライメント空間の安定性と本質的に関係していることが示唆された。我々の研究は、挑戦的なベンチマークと、静的な実験室条件を超えて、真に一般化可能なオープン語彙システムを構築するための新しい視点を提供する。

論文の概要: Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

関連論文リスト