Fugu-MT 論文翻訳(概要): The Telephone Game: Evaluating Semantic Drift in Unified Models

論文の概要: The Telephone Game: Evaluating Semantic Drift in Unified Models

arxiv url: http://arxiv.org/abs/2509.04438v2
Date: Mon, 06 Oct 2025 17:49:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 14:28:10.651256
Title: The Telephone Game: Evaluating Semantic Drift in Unified Models
Title（参考訳）: 電話ゲーム:統一されたモデルでセマンティックドリフトを評価する
Authors: Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah,
Abstract要約: 単一の統一モデル(UM)を視覚的理解(画像からテキストへのI2T)と視覚生成(テキストから画像へのT2I)の両方に使用することで、ビジュアル言語モデル(VLM)研究の新しい方向性が開かれた。 FIDとGenEvalはT2I用であり、MMEやMMBenchはI2T用である。これらの孤立したシングルパスメトリクスは、相互整合性を明らかにしない。概念を"理解"するモデルが、それを"レンダリング"できるのか、意味的な意味を持つのか。
参考スコア（独自算出の注目度）: 41.650904633974584
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Code is available at https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models
Abstract（参考訳）: 単一の統一モデル(UM)を視覚的理解(画像からテキストへのI2T)と視覚生成(テキストから画像へのT2I)の両方に使用することで、ビジュアル言語モデル(VLM)研究の新しい方向性が開かれた。 UMは、より広範なユニモーダルタスク(例えば、テキストからテキストへ、イメージからイメージへ)もサポートできるが、コアの相互モーダルペアであるT2IとI2Tに注力する。 FIDとGenEvalはT2I用であり、MMEやMMBenchはI2T用である。概念を"理解"するモデルが"レンダリング"できるのか、イメージとテキストのモダリティをサイクリングする際に意味的な意味が保存されるのか。そこで本研究では,数世代にわたってI2TとT2Iを交換し,セマンティックドリフトを定量化する循環評価プロトコルである,統一モデルのためのセマンティックドリフトプロトコル(SDP)を提案する。私たちは2つの指標を提案します。一総合的意味的ドリフトの埋め込みに基づく平均累積ドリフト(MCD)及び (ii)GenEvalを拡張するオブジェクトレベルのコンプライアンススコアであるMulti-Generation GenEval(MGG)。トレーニングで広く使用されているCOCOデータセット以外の一般化を評価するため、NoCapsとDOCCIからサンプルを得たNocaps+Docci400という新しいベンチマークを作成し、7つの最新のモデルで評価する。 BAGELのようなモデルは、多くの変更点に対して意味を保ち、VILA-Uのようなモデルは、強いシングルパススコアにもかかわらず素早くドリフトする。以上の結果から,SDPは標準I2TおよびT2I評価に欠かせない補完となることが明らかとなった。コードはhttps://github.com/mollahsabbir/Semantic-Drift-in-Unified-Modelsで入手できる。

論文の概要: The Telephone Game: Evaluating Semantic Drift in Unified Models

関連論文リスト