From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Situation Recognition
- URL: http://arxiv.org/abs/2507.14686v2
- Date: Tue, 29 Jul 2025 16:42:06 GMT
- Title: From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Situation Recognition
- Authors: Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew,
- Abstract summary: Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR)<n>We exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities.<n>We propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model.
- Score: 14.16399307533106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.
Related papers
- MLLM-CL: Continual Learning for Multimodal Large Language Models [62.90736445575181]
We introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning.<n>Our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods.
arXiv Detail & Related papers (2025-06-05T17:58:13Z) - Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations [65.11348389219887]
We introduce Dialectic-RAG (DRAG), a modular approach that evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives.<n>We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models.
arXiv Detail & Related papers (2025-04-07T06:55:15Z) - Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [89.50068130832635]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation MLLMs by multimodal knowledge.<n>We propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z) - Towards Modality Generalization: A Benchmark and Prospective Analysis [68.20973671493203]
This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities.<n>We propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization.<n>Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
arXiv Detail & Related papers (2024-12-24T08:38:35Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning)
The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs.
It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z) - Causality-based Cross-Modal Representation Learning for
Vision-and-Language Navigation [15.058687283978077]
Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios.
Existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments.
We propose a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations.
arXiv Detail & Related papers (2024-03-06T02:01:38Z) - Deep Multimodal Fusion for Generalizable Person Re-identification [15.250738959921872]
DMF is a Deep Multimodal Fusion network for the general scenarios on person re-identification task.
Rich semantic knowledge is introduced to assist in feature representation learning during the pre-training stage.
A realistic dataset is adopted to fine-tine the pre-trained model for distribution alignment with real-world.
arXiv Detail & Related papers (2022-11-02T07:42:48Z) - From Mimicking to Integrating: Knowledge Integration for Pre-Trained
Language Models [55.137869702763375]
This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI)
KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model.
We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student.
arXiv Detail & Related papers (2022-10-11T07:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.