Fugu-MT 論文翻訳(概要): Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

論文の概要: Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

arxiv url: http://arxiv.org/abs/2604.21728v2
Date: Mon, 27 Apr 2026 13:48:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:06.93549
Title: Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Title（参考訳）: Ramen: アクティブサンプル選択による視覚言語モデルのロバストなテスト時間適応
Authors: Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He,
Abstract要約: アクティブサンプル選択による堅牢なテスト時間適応のためのフレームワークであるRamenを提示する。入ってくるテストサンプル毎に、Ramenは、以前に見たデータから、関連するサンプルのカスタマイズされたバッチを取得する。複数の画像破損とドメインシフトベンチマークの実験は、Ramenが強力で一貫したパフォーマンスを達成することを示した。
参考スコア（独自算出の注目度）: 45.20212930761406
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .
Abstract（参考訳）: CLIPのような事前訓練された視覚言語モデルは、強いゼロショット一般化を示すが、分散シフトには敏感である。テストタイム適応は、ソースデータやターゲットラベルにアクセスせずに推論中にモデルを適応させ、そのようなシフトを処理する実用的な方法を提供する。しかし、既存のメソッドは通常、テストサンプルは単一の一貫したドメインから来ていると仮定するが、実際には、テストデータは異なる特性を持つ混合ドメインからのサンプルを含んでいることが多い。その結果、パフォーマンスは混合ドメイン設定で低下する。そこで本稿では,アクティブサンプル選択による堅牢なテスト時間適応のためのフレームワークであるRamenを紹介する。ドメイン整合性(ドメイン整合性)は、類似したドメインのデータに適応することを保証するもので、予測バランス(予測バイアス)は、歪んだ予測による適応バイアスを緩和する。効率を改善するため、Ramenでは、過去のテストイメージの埋め込みとサンプルレベルの勾配を格納する埋め込み段階のキャッシュを使用している。格納された埋め込みは関連するサンプルを取得するために使用され、対応する勾配はモデル更新のために集約されるため、追加の前方または後方パスは不要である。提案手法が混合ドメインシフトにおいて有効である理由を理論的解析により明らかにした。複数の画像破損とドメインシフトベンチマークの実験は、Ramenが堅牢で一貫したパフォーマンスを実現し、複雑な混合ドメインシナリオに堅牢で効率的な適応を提供することを示した。私たちのコードはhttps://github.com/baowenxuan/Ramenで公開されています。

論文の概要: Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

関連論文リスト