Fugu-MT 論文翻訳(概要): Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

論文の概要: Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

arxiv url: http://arxiv.org/abs/2603.12538v1
Date: Fri, 13 Mar 2026 00:37:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.822142
Title: Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation
Title（参考訳）: 画像セグメンテーションの参照のためのMixture-of-Expertを用いた空間分割型エキスパートルーティングアーキテクチャ
Authors: Alaa Dalaq, Muzammil Behzad,
Abstract要約: 画像セグメント化の参照は、自然言語表現によって記述された画像領域のためのピクセルレベルのマスクを作成することを目的としている。画像セグメンテーションを参照するための空間分割型エキスパートルーティングアーキテクチャSERAを提案する。 SERAは、視覚言語フレームワーク内の2つの相補的な段階において、軽量で表現を意識した専門家の洗練を導入する。
参考スコア（独自算出の注目度）: 0.3437656066916039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.
Abstract（参考訳）: 画像セグメント化の参照は、自然言語表現によって記述された画像領域のためのピクセルレベルのマスクを作成することを目的としている。事前学習された視覚言語モデルはセマンティックグラウンドリングを改善したが、既存の多くの手法は、参照表現の多様な推論要求に完全に適合しない一様洗練戦略に依存している。このミスマッチのため、予測はしばしば断片化された領域、不正確な境界、あるいは間違った対象を含む。これらの制約に対処するため,画像セグメント化を参照するためのSERA(Spatio-Semantic Expert Routing Architecture)を提案する。 SERAは、視覚言語フレームワーク内の2つの相補的な段階において、軽量で表現を意識した専門家の洗練を導入する。まず,表現条件付きアダプタを選択したバックボーンブロックに挿入するSERA-Adapterを設計する。次に、SERA-Fusionを導入し、トークンの特徴を空間格子に変換し、マルチモーダル相互作用の前に幾何保存の専門家変換を適用することにより、中間的な視覚表現を強化する。さらに、軽量なルーティング機構は、事前訓練された表現との互換性を維持しながら、専門家の貢献を適応的に重み付けする。このルーティングを凍結エンコーダの下で安定させるため、SERAは正規化とバイアス項のみを更新し、バックボーンパラメータの1%未満に影響するパラメータ効率のチューニング戦略を使用している。標準的な参照画像セグメンテーションベンチマークの実験では、SERAは強いベースラインを一貫して上回り、特に正確な空間的局所化と正確な境界デラインを必要とする表現に対して顕著な利得を示している。

関連論文リスト

ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding [44.20713526887855]
本稿では,意味的,幾何学的,テクスチュラルな属性を独立したパラメータ空間に分割する階層的プロキシベースパラメトリック画像表現を提案する。本手法は,直感的,対話的,物理的に妥当な操作が可能でありながら,パラメータが大幅に少ない最先端のレンダリング忠実度を実現する。
論文参考訳（メタデータ） (2026-02-02T09:53:45Z)
DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation [16.57245702815661]
Open-vocabulary semantic segmentationは、テキストラベルに基づいた画像内の各ピクセルにラベルを割り当てることを目的としている。既存のアプローチでは、CLIPのような視覚言語モデル(VLM)を高密度な予測に利用するのが一般的である。本稿では,新しいサリエンシを意識したフォアグラウンド・バックアングル型フレームワークであるDiSaを紹介する。
論文参考訳（メタデータ） (2026-01-27T21:15:10Z)
Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
テキストと画像の拡散モデルは、言語翻訳において優れているため、モーダル間の注意機構を通じて暗黙的に概念を基礎づける。近年のマルチモーダル拡散トランスフォーマーでは, 共用画像とテキストトークンを導入し, よりリッチでスケーラブルなクロスモーダルアライメントを実現している。 MM-DiTの注意構造を分析するための体系的フレームワークであるSeg4Diffを導入し,テキストから画像への意味情報の伝達方法に着目した。
論文参考訳（メタデータ） (2025-09-22T17:59:54Z)
DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models [9.109484087832058]
DiffRISは、RRSISタスクのための事前訓練されたテキスト-画像拡散モデルのセマンティック理解機能を利用する新しいフレームワークである。我々のフレームワークは、文脈認識アダプタ(CP-adapter)とクロスモーダル推論デコーダ(PCMRD)の2つの重要なイノベーションを導入している。
論文参考訳（メタデータ） (2025-06-23T02:38:56Z)
Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
リモートセンシング画像セグメンテーション(RRSIS)の目的は、自然言語表現を用いて、空中画像内の特定のピクセルレベル領域を抽出することである。本稿では,これらの課題に対処するため,SBANet(Scale-wise Bidirectional Alignment Network)と呼ばれる革新的なフレームワークを提案する。提案手法は,RRSIS-DとRefSegRSのデータセットにおける従来の最先端手法と比較して,優れた性能を実現する。
論文参考訳（メタデータ） (2025-01-01T14:24:04Z)
Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
リモートセンシング画像セグメンテーション(RRSIS)の目標は、参照式によって識別された対象オブジェクトの画素レベルマスクを生成することである。上記の課題に対処するため、クロスモーダル双方向相互作用モデル(CroBIM)と呼ばれる新しいRRSISフレームワークが提案されている。 RRSISの研究をさらに推し進めるために、52,472個の画像言語ラベル三重項からなる新しい大規模ベンチマークデータセットRISBenchを構築した。
論文参考訳（メタデータ） (2024-10-11T08:28:04Z)
Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues [55.97779732051921]
オーキューを分類器学習に明示的に組み込むための新しい学習戦略が提案されている。分類性能を劣化させることなく階層的解釈性を向上させることができることを示す。
論文参考訳（メタデータ） (2024-02-01T02:13:49Z)
GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding [101.32590239809113]
Generalized Perception NeRF (GP-NeRF) は、広く使われているセグメンテーションモデルとNeRFを統一されたフレームワークで相互に動作させる新しいパイプラインである。本稿では,セマンティック蒸留損失(Semantic Distill Loss)とDepth-Guided Semantic Distill Loss(Depth-Guided Semantic Distill Loss)という2つの自己蒸留機構を提案する。
論文参考訳（メタデータ） (2023-11-20T15:59:41Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。