Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
- URL: http://arxiv.org/abs/2307.11558v1
- Date: Fri, 21 Jul 2023 13:06:02 GMT
- Title: Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
- Authors: Zhihong Chen, Ruifei Zhang, Yibing Song, Xiang Wan, Guanbin Li
- Abstract summary: Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
- Score: 74.72663425217522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual grounding (VG) aims to establish fine-grained alignment between vision
and language. Ideally, it can be a testbed for vision-and-language models to
evaluate their understanding of the images and texts and their reasoning
abilities over their joint space. However, most existing VG datasets are
constructed using simple description texts, which do not require sufficient
reasoning over the images and texts. This has been demonstrated in a recent
study~\cite{luo2022goes}, where a simple LSTM-based text encoder without
pretraining can achieve state-of-the-art performance on mainstream VG datasets.
Therefore, in this paper, we propose a novel benchmark of \underline{S}cene
\underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG),
where the image content and referring expressions are not sufficient to ground
the target objects, forcing the models to have a reasoning ability on the
long-form scene knowledge. To perform this task, we propose two approaches to
accept the triple-type input, where the former embeds knowledge into the image
features before the image-query interaction; the latter leverages linguistic
structure to assist in computing the image-text matching. We conduct extensive
experiments to analyze the above methods and show that the proposed approaches
achieve promising results but still leave room for improvement, including
performance and interpretability. The dataset and code are available at
\url{https://github.com/zhjohnchan/SK-VG}.
Related papers
- Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.