Fugu-MT 論文翻訳(概要): SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

論文の概要: SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

arxiv url: http://arxiv.org/abs/2512.01148v1
Date: Sun, 30 Nov 2025 23:54:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-02 19:46:34.604171
Title: SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models
Title（参考訳）: ソーシャルフュージョン : 事前学習型視覚言語モデルにおける社会的劣化への対処
Authors: Hamza Tahboub, Weiyan Shi, Gang Hua, Huaizu Jiang,
Abstract要約: 事前学習された視覚言語モデル(VLM)は,複数の社会的知覚タスクを同時に統合し,学習することの難しさを示す。凍結したビジュアルエンコーダと言語モデルとの間の最小限の接続を学習する統合フレームワークであるSocialFusionを提案する。以上の結果から,現在のVLM事前学習戦略は一般社会能力獲得に有害である可能性が示唆された。
参考スコア（独自算出の注目度）: 34.928133808112925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.
Abstract（参考訳）: 視覚的な手がかりから社会的相互作用を理解することは、社会的に有能なAIにとって根本的な課題である。強力な事前学習型視覚言語モデル(VLM)は目覚ましい汎用性を示してきたが、同時に複数の社会的知覚タスクを統一し学習するのに驚くほど苦労し、しばしば負の伝達を示す。この負の伝達は、VLMの一般的な視覚言語事前学習プロセスが、視覚エンコーダのニュアンスな社会情報表現能力を損なう「社会的劣化」という重要な問題に起因していると確認する。本研究は, 線形表現探索による偏極性, 勾配競合解析による整合性, 劣化, 特にVLM事前学習過程において著しく損なわれている前者において, 両者が役割を担っていることを明らかにする。こうした問題に対処するために,凍結したビジュアルエンコーダと言語モデルとの間の最小限の接続を学習する統合フレームワークであるSocialFusionを提案する。既存のVLMと比較すると、5つのソーシャルタスクすべてに肯定的な移行を示し、それらのシナジーを利用して全体的なパフォーマンスを高め、様々なベンチマークでタスク固有の最先端モデルに匹敵するパフォーマンスを達成する。以上の結果から,現在のVLM事前学習戦略は,一般の社会的能力獲得に有害であり,より社会的に意識した訓練パラダイムの必要性を強調している可能性が示唆された。

論文の概要: SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

関連論文リスト