Related papers: SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

URL: http://arxiv.org/abs/2512.01148v1
Date: Sun, 30 Nov 2025 23:54:54 GMT
Title: SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models
Authors: Hamza Tahboub, Weiyan Shi, Gang Hua, Huaizu Jiang,
Abstract summary: We show that pre-trained vision-language models (VLMs) struggle to unify and learn multiple social perception tasks simultaneously.<n>We propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model.<n>Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence.
Score: 34.928133808112925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

Related papers

Interpretable Debiasing of Vision-Language Models for Social Fairness [55.85977929985967]
We introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in Vision-Language models.<n>We train SAEs on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics.<n>Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
arXiv Detail & Related papers (2026-02-27T13:37:11Z)
Neural Synchrony Between Socially Interacting Language Models [52.74586779814636]
Large language models (LLMs) are widely accepted as powerful approximations of human behavior.<n>It remains controversial whether they can be meaningfully compared to human social minds.
arXiv Detail & Related papers (2026-02-19T20:33:54Z)
MARO: Learning Stronger Reasoning from Social Interaction [7.77506109184819]
Multi-Agent Reward Optimization (MARO) is a method that enables large language models to acquire stronger reasoning abilities.<n> Experimental results demonstrate that MARO achieves significant improvements in social reasoning capabilities.
arXiv Detail & Related papers (2026-01-18T09:10:08Z)
SVBench: Evaluation of Video Generation Models on Social Reasoning [35.06131184286366]
We introduce the first benchmark for social reasoning in video generation.<n>We develop a fully training-free agent-based pipeline that distills the reasoning mechanism of each experiment.<n>We conduct the first large-scale study across seven state-of-the-art video generation systems.
arXiv Detail & Related papers (2025-12-25T04:44:59Z)
SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models [41.68365456601248]
We introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning.<n>SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty.<n>It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms.
arXiv Detail & Related papers (2025-05-29T17:47:36Z)
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? [48.41029452721923]
Vision Language Models (VLMs) are impressive at visual question answering and image captioning.<n>But they underperform on multi-step visual reasoning, giving rise to perceptions of modality imbalance or brittleness.<n>We introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning.
arXiv Detail & Related papers (2025-01-05T21:36:38Z)
From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition [59.57095498284501]
We propose a novel approach that recognizes textbfContextual textbfSocial textbfRelationships (textbfConSoR) from a social cognitive perspective. We construct social-aware descriptive language prompts with social relationships for each image. Impressively, ConSoR outperforms previous methods with a 12.2% gain on the People-in-Social-Context (PISC) dataset and a 9.8% increase on the People-in-Photo-Album (PIPA) benchmark.
arXiv Detail & Related papers (2024-06-12T16:02:28Z)
SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM) SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z)
Training Socially Aligned Language Models on Simulated Social Interactions [99.39979111807388]
Social alignment in AI systems aims to ensure that these models behave according to established societal values. Current language models (LMs) are trained to rigidly replicate their training corpus in isolation. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions.
arXiv Detail & Related papers (2023-05-26T14:17:36Z)
Social Processes: Self-Supervised Forecasting of Nonverbal Cues in Social Conversations [22.302509912465077]
We take the first step in the direction of a bottom-up self-supervised approach in the domain of social human interactions. We formulate the task of Social Cue Forecasting to leverage the larger amount of unlabeled low-level behavior cues. We propose the Social Process (SP) models--socially aware sequence-to-sequence (Seq2Seq) models within the Neural Process (NP) family.
arXiv Detail & Related papers (2021-07-28T18:01:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.