Adversarial Testing for Visual Grounding via Image-Aware Property
Reduction
- URL: http://arxiv.org/abs/2403.01118v1
- Date: Sat, 2 Mar 2024 08:03:42 GMT
- Title: Adversarial Testing for Visual Grounding via Image-Aware Property
Reduction
- Authors: Zhiyuan Chang, Mingyang Li, Junjie Wang, Cheng Li, Boyu Wu, Fanjiang
Xu, Qing Wang
- Abstract summary: PEELING is a text perturbation approach via image-aware property reduction for adversarial testing of the Visual Grounding model.
It achieves 21.4% in MultiModal Impact score (MMI), and outperforms state-of-the-art baselines for images and texts by 8.2%--15.1%.
- Score: 12.745111000109178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the advantages of fusing information from various modalities,
multimodal learning is gaining increasing attention. Being a fundamental task
of multimodal learning, Visual Grounding (VG), aims to locate objects in images
through natural language expressions. Ensuring the quality of VG models
presents significant challenges due to the complex nature of the task. In the
black box scenario, existing adversarial testing techniques often fail to fully
exploit the potential of both modalities of information. They typically apply
perturbations based solely on either the image or text information,
disregarding the crucial correlation between the two modalities, which would
lead to failures in test oracles or an inability to effectively challenge VG
models. To this end, we propose PEELING, a text perturbation approach via
image-aware property reduction for adversarial testing of the VG model. The
core idea is to reduce the property-related information in the original
expression meanwhile ensuring the reduced expression can still uniquely
describe the original object in the image. To achieve this, PEELING first
conducts the object and properties extraction and recombination to generate
candidate property reduction expressions. It then selects the satisfied
expressions that accurately describe the original object while ensuring no
other objects in the image fulfill the expression, through querying the image
with a visual understanding technique. We evaluate PEELING on the
state-of-the-art VG model, i.e. OFA-VG, involving three commonly used datasets.
Results show that the adversarial tests generated by PEELING achieves 21.4% in
MultiModal Impact score (MMI), and outperforms state-of-the-art baselines for
images and texts by 8.2%--15.1%.
Related papers
- Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Randomize to Generalize: Domain Randomization for Runway FOD Detection [1.4249472316161877]
Tiny Object Detection is challenging due to small size, low resolution, occlusion, background clutter, lighting conditions and small object-to-image ratio.
We propose a novel two-stage methodology Synthetic Image Augmentation (SRIA) to enhance generalization capabilities of models encountering 2D datasets.
We report that detection accuracy improved from an initial 41% to 92% for OOD test set.
arXiv Detail & Related papers (2023-09-23T05:02:31Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction [59.76117533540496]
We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
arXiv Detail & Related papers (2023-06-01T05:39:45Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual
Context for Image Captioning [25.728621355173626]
Key limitation of current methods is that the output of the model is conditioned only on the object detector's outputs.
We propose to add an auxiliary input to represent missing information such as object relationships.
We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art.
arXiv Detail & Related papers (2022-05-09T15:05:24Z) - Contrastive Learning of Visual-Semantic Embeddings [4.7464518249313805]
We propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding.
We compare our results with existing visual-semantic embedding methods on cross-modal image-to-text and text-to-image retrieval tasks.
arXiv Detail & Related papers (2021-10-17T17:28:04Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - Cops-Ref: A new Dataset and Task on Compositional Referring Expression
Comprehension [39.40351938417889]
Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression.
Some popular referring expression datasets fail to provide an ideal test bed for evaluating the reasoning ability of the models.
We propose a new dataset for visual reasoning in context of referring expression comprehension with two main features.
arXiv Detail & Related papers (2020-03-01T04:59:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.