Related papers: Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

URL: http://arxiv.org/abs/2310.01330v1
Date: Mon, 2 Oct 2023 16:48:50 GMT
Title: Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association
Authors: Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi Wakaki, Yuki Mitsufuji
Abstract summary: We focus on the wide existence of reporting bias in visual-language datasets. We propose a bimodal augmentation (BiAug) approach to mitigate this bias. BiAug synthesizes visual-language examples with a rich array of object-attribute pairing and construct cross-modal hard negatives.
Score: 23.06058982328083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reporting bias arises when people assume that some knowledge is universally understood and hence, do not necessitate explicit elaboration. In this paper, we focus on the wide existence of reporting bias in visual-language datasets, embodied as the object-attribute association, which can subsequentially degrade models trained on them. To mitigate this bias, we propose a bimodal augmentation (BiAug) approach through object-attribute decoupling to flexibly synthesize visual-language examples with a rich array of object-attribute pairing and construct cross-modal hard negatives. We employ large language models (LLMs) in conjunction with a grounding object detector to extract target objects. Subsequently, the LLM generates a detailed attribute description for each object and produces a corresponding hard negative counterpart. An inpainting model is then used to create images based on these detailed object descriptions. By doing so, the synthesized examples explicitly complement omitted objects and attributes to learn, and the hard negative pairs steer the model to distinguish object attributes. Our experiments demonstrated that BiAug is superior in object-attribute understanding. In addition, BiAug also improves the performance on zero-shot retrieval tasks on general benchmarks like MSCOCO and Flickr30K. BiAug refines the way of collecting text-image datasets. Mitigating the reporting bias helps models achieve a deeper understanding of visual-language phenomena, expanding beyond mere frequent patterns to encompass the richness and diversity of real-world scenarios.

Related papers

Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning [83.10178754323955]
Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network is proposed to solve the problem of complex interactions between attributes and object visual representations. To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module. To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module. The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.
arXiv Detail & Related papers (2024-11-28T09:50:25Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding. We have established a new REC dataset characterized by two key features. It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z)
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image. We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap [50.079224604394]
We present a novel model-agnostic framework called textbfContext-textbfEnhanced textbfFeature textbfAment (CEFA) CEFA consists of a feature alignment module and a context enhancement module. Our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.
arXiv Detail & Related papers (2024-07-31T08:42:48Z)
Exploiting Contextual Target Attributes for Target Sentiment Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task. We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z)
Object Attribute Matters in Visual Question Answering [15.705504296316576]
We propose a novel VQA approach from the perspective of utilizing object attribute. The attribute fusion module constructs a multimodal graph neural network to fuse attributes and visual features through message passing. The better object-level visual-language alignment aids in understanding multimodal scenes, thereby improving the model's robustness.
arXiv Detail & Related papers (2023-12-20T12:46:30Z)
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object) We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z)
Learning Dynamic Attribute-factored World Models for Efficient Multi-object Reinforcement Learning [6.447052211404121]
In many reinforcement learning tasks, the agent has to learn to interact with many objects of different types and generalize to unseen combinations and numbers of objects. Recent works have shown the benefits of object-factored representations and hierarchical abstractions for improving sample efficiency. We introduce the Dynamic Attribute FacTored RL (DAFT-RL) framework to exploit the benefits of factorization in terms of object attributes.
arXiv Detail & Related papers (2023-07-18T12:41:28Z)
Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.