ViLLA: Fine-Grained Vision-Language Representation Learning from
Real-World Data
- URL: http://arxiv.org/abs/2308.11194v1
- Date: Tue, 22 Aug 2023 05:03:09 GMT
- Title: ViLLA: Fine-Grained Vision-Language Representation Learning from
Real-World Data
- Authors: Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari,
Curtis Langlotz
- Abstract summary: Vision-language models (VLMs) are generally trained on datasets consisting of image-caption pairs obtained from the web.
Real-world multimodal datasets, such as healthcare data, are significantly more complex.
ViLLA is trained to capture fine-grained region-attribute relationships from complex datasets.
- Score: 8.905439446173503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained
on datasets consisting of image-caption pairs obtained from the web. However,
real-world multimodal datasets, such as healthcare data, are significantly more
complex: each image (e.g. X-ray) is often paired with text (e.g. physician
report) that describes many distinct attributes occurring in fine-grained
regions of the image. We refer to these samples as exhibiting high pairwise
complexity, since each image-text pair can be decomposed into a large number of
region-attribute pairings. The extent to which VLMs can capture fine-grained
relationships between image regions and textual attributes when trained on such
data has not been previously evaluated. The first key contribution of this work
is to demonstrate through systematic evaluations that as the pairwise
complexity of the training dataset increases, standard VLMs struggle to learn
region-attribute relationships, exhibiting performance degradations of up to
37% on retrieval tasks. In order to address this issue, we introduce ViLLA as
our second key contribution. ViLLA, which is trained to capture fine-grained
region-attribute relationships from complex datasets, involves two components:
(a) a lightweight, self-supervised mapping model to decompose image-text
samples into region-attribute pairs, and (b) a contrastive VLM to learn
representations from generated region-attribute pairs. We demonstrate with
experiments across four domains (synthetic, product, medical, and natural
images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks,
such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP
points on LVIS) and retrieval (up to 14.2 R-Precision points).
Related papers
- RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models [18.984025219051404]
Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time.
We present RaVL, which takes a fine-grained perspective on VLM by discovering and mitigating spurious correlations using local image features.
arXiv Detail & Related papers (2024-11-06T18:25:00Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models [21.17975741743583]
It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions can significantly enhance zero-shot performance.
In this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image.
arXiv Detail & Related papers (2024-06-05T04:08:41Z) - RSGPT: A Remote Sensing Vision Language Model and Benchmark [7.279747655485913]
We build a high-quality Remote Sensing Image Captioning dataset (RSICap)
This dataset comprises 2,585 human-annotated captions with rich and high-quality information.
We also provide a benchmark evaluation dataset called RSIEval.
arXiv Detail & Related papers (2023-07-28T02:23:35Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Campus3D: A Photogrammetry Point Cloud Benchmark for Hierarchical
Understanding of Outdoor Scene [76.4183572058063]
We present a richly-annotated 3D point cloud dataset for multiple outdoor scene understanding tasks.
The dataset has been point-wisely annotated with both hierarchical and instance-based labels.
We formulate a hierarchical learning problem for 3D point cloud segmentation and propose a measurement evaluating consistency across various hierarchies.
arXiv Detail & Related papers (2020-08-11T19:10:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.