Towards reporting bias in visual-language datasets: bimodal augmentation
by decoupling object-attribute association
- URL: http://arxiv.org/abs/2310.01330v1
- Date: Mon, 2 Oct 2023 16:48:50 GMT
- Title: Towards reporting bias in visual-language datasets: bimodal augmentation
by decoupling object-attribute association
- Authors: Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi
Wakaki, Yuki Mitsufuji
- Abstract summary: We focus on the wide existence of reporting bias in visual-language datasets.
We propose a bimodal augmentation (BiAug) approach to mitigate this bias.
BiAug synthesizes visual-language examples with a rich array of object-attribute pairing and construct cross-modal hard negatives.
- Score: 23.06058982328083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reporting bias arises when people assume that some knowledge is universally
understood and hence, do not necessitate explicit elaboration. In this paper,
we focus on the wide existence of reporting bias in visual-language datasets,
embodied as the object-attribute association, which can subsequentially degrade
models trained on them. To mitigate this bias, we propose a bimodal
augmentation (BiAug) approach through object-attribute decoupling to flexibly
synthesize visual-language examples with a rich array of object-attribute
pairing and construct cross-modal hard negatives. We employ large language
models (LLMs) in conjunction with a grounding object detector to extract target
objects. Subsequently, the LLM generates a detailed attribute description for
each object and produces a corresponding hard negative counterpart. An
inpainting model is then used to create images based on these detailed object
descriptions. By doing so, the synthesized examples explicitly complement
omitted objects and attributes to learn, and the hard negative pairs steer the
model to distinguish object attributes. Our experiments demonstrated that BiAug
is superior in object-attribute understanding. In addition, BiAug also improves
the performance on zero-shot retrieval tasks on general benchmarks like MSCOCO
and Flickr30K. BiAug refines the way of collecting text-image datasets.
Mitigating the reporting bias helps models achieve a deeper understanding of
visual-language phenomena, expanding beyond mere frequent patterns to encompass
the richness and diversity of real-world scenarios.
Related papers
- Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap [50.079224604394]
We present a novel model-agnostic framework called textbfContext-textbfEnhanced textbfFeature textbfAment (CEFA)
CEFA consists of a feature alignment module and a context enhancement module.
Our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.
arXiv Detail & Related papers (2024-07-31T08:42:48Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Object Attribute Matters in Visual Question Answering [15.705504296316576]
We propose a novel VQA approach from the perspective of utilizing object attribute.
The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing.
The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness.
arXiv Detail & Related papers (2023-12-20T12:46:30Z) - Hierarchical Visual Primitive Experts for Compositional Zero-Shot
Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object)
We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues.
Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z) - Learning Dynamic Attribute-factored World Models for Efficient
Multi-object Reinforcement Learning [6.447052211404121]
In many reinforcement learning tasks, the agent has to learn to interact with many objects of different types and generalize to unseen combinations and numbers of objects.
Recent works have shown the benefits of object-factored representations and hierarchical abstractions for improving sample efficiency.
We introduce the Dynamic Attribute FacTored RL (DAFT-RL) framework to exploit the benefits of factorization in terms of object attributes.
arXiv Detail & Related papers (2023-07-18T12:41:28Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.