VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence
- URL: http://arxiv.org/abs/2504.02227v1
- Date: Thu, 03 Apr 2025 02:48:21 GMT
- Title: VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence
- Authors: Hao Li, Hao Fei, Zechao Hu, Zhengwei Yang, Zheng Wang,
- Abstract summary: Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model's social intelligence level.<n>We propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model.
- Score: 22.086567828557683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model's social intelligence level. While impressive multiple-choice question(MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model's interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal-language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to of fer a new perspective on Social-IQ and advance the development of human-like social AI.
Related papers
- VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity [34.29409506366145]
VERIFY is a benchmark designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.<n>Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes.<n>We propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns.
arXiv Detail & Related papers (2025-03-14T16:26:11Z) - Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data [35.229595049396245]
We propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs.<n>Our approach begins by synthesizing interpretable answers that include human-verifiable visual features.<n>After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers.
arXiv Detail & Related papers (2025-02-19T19:05:45Z) - Visual Agents as Fast and Slow Thinkers [88.1404921693082]
We introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents.<n>FaST employs a switch adapter to dynamically select between System 1/2 modes.<n>It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data.
arXiv Detail & Related papers (2024-08-16T17:44:02Z) - From Feature Importance to Natural Language Explanations Using LLMs with RAG [4.204990010424084]
We introduce traceable question-answering, leveraging an external knowledge repository to inform responses of Large Language Models (LLMs)
This knowledge repository comprises contextual details regarding the model's output, containing high-level features, feature importance, and alternative probabilities.
We integrate four key characteristics - social, causal, selective, and contrastive - drawn from social science research on human explanations into a single-shot prompt, guiding the response generation process.
arXiv Detail & Related papers (2024-07-30T17:27:20Z) - Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation [34.45251681923171]
This paper presents a novel approach to develop a large Vision-and-Language Models (VLMs)
We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process.
The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge.
arXiv Detail & Related papers (2024-01-18T14:21:56Z) - DeSIQ: Towards an Unbiased, Challenging Benchmark for Social
Intelligence Understanding [60.84356161106069]
We study the soundness of Social-IQ, a dataset of multiple-choice questions on videos of complex social interactions.
Our analysis reveals that Social-IQ contains substantial biases, which can be exploited by a moderately strong language model.
We introduce DeSIQ, a new challenging dataset, constructed by applying simple perturbations to Social-IQ.
arXiv Detail & Related papers (2023-10-24T06:21:34Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - elBERto: Self-supervised Commonsense Learning for Question Answering [131.51059870970616]
We propose a Self-supervised Bidirectional Representation Learning of Commonsense framework, which is compatible with off-the-shelf QA model architectures.
The framework comprises five self-supervised tasks to force the model to fully exploit the additional training signals from contexts containing rich commonsense.
elBERto achieves substantial improvements on out-of-paragraph and no-effect questions where simple lexical similarity comparison does not help.
arXiv Detail & Related papers (2022-03-17T16:23:45Z) - COSMO: Conditional SEQ2SEQ-based Mixture Model for Zero-Shot Commonsense
Question Answering [50.65816570279115]
Identification of the implicit causes and effects of a social context is the driving capability which can enable machines to perform commonsense reasoning.
Current approaches in this realm lack the ability to perform commonsense reasoning upon facing an unseen situation.
We present Conditional SEQ2SEQ-based Mixture model (COSMO), which provides us with the capabilities of dynamic and diverse content generation.
arXiv Detail & Related papers (2020-11-02T07:08:19Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning" [49.76230210108583]
We propose a framework to isolate and evaluate the reasoning aspect of visual question answering (VQA) separately from its perception.
We also propose a novel top-down calibration technique that allows the model to answer reasoning questions even with imperfect perception.
On the challenging GQA dataset, this framework is used to perform in-depth, disentangled comparisons between well-known VQA models.
arXiv Detail & Related papers (2020-06-20T08:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.