Related papers: Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

URL: http://arxiv.org/abs/2305.06343v2
Date: Tue, 24 Oct 2023 21:40:00 GMT
Title: Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Authors: Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson
Abstract summary: We show that it is possible to improve vision and language models (VLMs) when learning from scene graphs (SGs) For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions. Our method improves the performance of several popular VLMs on multiple datasets with only a mild degradation in ZS capabilities.
Score: 79.64891686479213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS) performance in a variety of tasks. However, recent works have shown that even the best VLMs struggle to capture aspects of compositional scene understanding, such as object attributes, relations, and action states. In contrast, obtaining structured annotations, such as scene graphs (SGs), that could improve these models is time-consuming and costly, and thus cannot be used on a large scale. Here we ask whether small SG datasets can provide sufficient information for enhancing structured understanding of pretrained VLMs. We show that it is indeed possible to improve VLMs when learning from SGs by integrating components that incorporate structured information into both visual and textual representations. For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions that highlight different compositional aspects of the scene. Our method improves the performance of several popular VLMs on multiple VL datasets with only a mild degradation in ZS capabilities.

Related papers

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities. We propose a novel framework based on multimodal retrieval-augmented generation (RAG) RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z)
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity [68.15983300711355]
FineCAPTION is a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different levels. We introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning.
arXiv Detail & Related papers (2024-11-23T02:20:32Z)
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models [9.936172224069036]
We introduce a Scene Graph Expression (SGE) module in large vision-language models (VLMs) SGE module extracts and structurally expresses the complex semantic information within images. Experiments demonstrate that integrating our SGE module significantly enhances the VLM's performance in vision-language tasks.
arXiv Detail & Related papers (2024-08-29T02:43:20Z)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. These models have been shown to be highly capable, but also lacking some basic visual understanding skills. This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z)
In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data. We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses. Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method. It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z)
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo) DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs) We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z)
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions [126.3136109870403]
We introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C) VPG-C infers and completes the missing details essential for comprehending demonstrative instructions. We build DEMON, a comprehensive benchmark for demonstrative instruction understanding.
arXiv Detail & Related papers (2023-08-08T09:32:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.