Towards reporting bias in visual-language datasets: bimodal augmentation
by decoupling object-attribute association
- URL: http://arxiv.org/abs/2310.01330v1
- Date: Mon, 2 Oct 2023 16:48:50 GMT
- Title: Towards reporting bias in visual-language datasets: bimodal augmentation
by decoupling object-attribute association
- Authors: Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi
Wakaki, Yuki Mitsufuji
- Abstract summary: We focus on the wide existence of reporting bias in visual-language datasets.
We propose a bimodal augmentation (BiAug) approach to mitigate this bias.
BiAug synthesizes visual-language examples with a rich array of object-attribute pairing and construct cross-modal hard negatives.
- Score: 23.06058982328083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reporting bias arises when people assume that some knowledge is universally
understood and hence, do not necessitate explicit elaboration. In this paper,
we focus on the wide existence of reporting bias in visual-language datasets,
embodied as the object-attribute association, which can subsequentially degrade
models trained on them. To mitigate this bias, we propose a bimodal
augmentation (BiAug) approach through object-attribute decoupling to flexibly
synthesize visual-language examples with a rich array of object-attribute
pairing and construct cross-modal hard negatives. We employ large language
models (LLMs) in conjunction with a grounding object detector to extract target
objects. Subsequently, the LLM generates a detailed attribute description for
each object and produces a corresponding hard negative counterpart. An
inpainting model is then used to create images based on these detailed object
descriptions. By doing so, the synthesized examples explicitly complement
omitted objects and attributes to learn, and the hard negative pairs steer the
model to distinguish object attributes. Our experiments demonstrated that BiAug
is superior in object-attribute understanding. In addition, BiAug also improves
the performance on zero-shot retrieval tasks on general benchmarks like MSCOCO
and Flickr30K. BiAug refines the way of collecting text-image datasets.
Mitigating the reporting bias helps models achieve a deeper understanding of
visual-language phenomena, expanding beyond mere frequent patterns to encompass
the richness and diversity of real-world scenarios.
Related papers
- Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Object Attribute Matters in Visual Question Answering [15.705504296316576]
We propose a novel VQA approach from the perspective of utilizing object attribute.
The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing.
The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness.
arXiv Detail & Related papers (2023-12-20T12:46:30Z) - Hierarchical Visual Primitive Experts for Compositional Zero-Shot
Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object)
We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues.
Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z) - Learning Dynamic Attribute-factored World Models for Efficient
Multi-object Reinforcement Learning [6.447052211404121]
In many reinforcement learning tasks, the agent has to learn to interact with many objects of different types and generalize to unseen combinations and numbers of objects.
Recent works have shown the benefits of object-factored representations and hierarchical abstractions for improving sample efficiency.
We introduce the Dynamic Attribute FacTored RL (DAFT-RL) framework to exploit the benefits of factorization in terms of object attributes.
arXiv Detail & Related papers (2023-07-18T12:41:28Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Disentangling Visual Embeddings for Attributes and Objects [38.27308243429424]
We study the problem of compositional zero-shot learning for object-attribute recognition.
Prior works use visual features extracted with a backbone network, pre-trained for object classification.
We propose a novel architecture that can disentangle attribute and object features in the visual space.
arXiv Detail & Related papers (2022-05-17T17:59:36Z) - Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder.
We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets.
We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Learning to Infer Unseen Attribute-Object Compositions [55.58107964602103]
A graph-based model is proposed that can flexibly recognize both single- and multi-attribute-object compositions.
We build a large-scale Multi-Attribute dataset with 116,099 images and 8,030 composition categories.
arXiv Detail & Related papers (2020-10-27T14:57:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.