Fugu-MT 論文翻訳(概要): The Telephone Game: Evaluating Semantic Drift in Unified Models

論文の概要: The Telephone Game: Evaluating Semantic Drift in Unified Models

arxiv url: http://arxiv.org/abs/2509.04438v1
Date: Thu, 04 Sep 2025 17:53:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-05 20:21:10.247899
Title: The Telephone Game: Evaluating Semantic Drift in Unified Models
Title（参考訳）: 電話ゲーム:統一されたモデルでセマンティックドリフトを評価する
Authors: Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah,
Abstract要約: Unified Consistency Framework for Unified Models (UCF-UM) UCF-UMは数世代にわたってI2TとT2Iを交換し、セマンティックドリフトを定量化する。結果は,標準I2TおよびT2I評価の補足として循環整合性を強調した。
参考スコア（独自算出の注目度）: 41.650904633974584
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model's cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models
Abstract（参考訳）: 単一の統一モデル(UM)を視覚的理解(画像からテキストへのI2T)と視覚生成(テキストから画像へのT2I)の両方に使用することで、ビジュアル言語モデル(VLM)研究の新しい方向性が開かれた。 UMは、より広範なユニモーダルタスク(例えば、テキストからテキストへ、イメージからイメージへ)もサポートできるが、下流での使用には、理解と生成の一貫性が不可欠であるため、コアのクロスモーダルペアであるT2IとI2Tに焦点を当てる。 FIDとGenEvalはT2I、ベンチマークはMME、MMBenchはI2Tである。これらのシングルパスメトリクスは、概念を理解するモデルがそれをレンダリングできるかどうか、また、画像とテキストのモダリティをサイクリングする際に意味が保存されているかどうかを明らかにしない。そこで本研究では,複数世代にわたってI2TとT2Iを交換し,セマンティックドリフトを定量化する循環評価プロトコルUCF-UMを提案する。 UCFは3つの指標を定式化します。一総合的意味損失の埋め込みに基づく尺度、平均累積ドリフト(MCD) セマンティックドリフトレート(SDR) 3 MGG(Multi-Generation GenEval)は、GenEvalを拡張したオブジェクトレベルのコンプライアンススコアである。トレーニングで広く使用されているCOCO以外の一般化を評価するため,NoCapsとDOCCIからサンプル化した新しいベンチマークND400を作成し,最近の7つのモデルで評価する。例えば、BAGELのようなモデルは、多くの変更点に対してセマンティクスを維持し、Vira-uのようなモデルは、強いシングルパススコアにもかかわらず素早くドリフトする。本結果は,標準I2TおよびT2I評価の補足として循環的整合性を強調し,統一モデルのクロスモーダル安定性と共有表現の強度を一貫して評価するための実測値を提供する。コード:https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

論文の概要: The Telephone Game: Evaluating Semantic Drift in Unified Models

関連論文リスト