Causal Debiasing for Visual Commonsense Reasoning
- URL: http://arxiv.org/abs/2510.20281v1
- Date: Thu, 23 Oct 2025 07:10:21 GMT
- Title: Causal Debiasing for Visual Commonsense Reasoning
- Authors: Jiayi Zou, Gengyun Jia, Bing-Kun Bao,
- Abstract summary: We introduce the VCR-OOD datasets, which are designed to evaluate the generalization capabilities of models across two modalities.<n>We analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias.<n>Experiments demonstrate the effectiveness of our debiasing method across different datasets.
- Score: 25.04845013043308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.
Related papers
- Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks [85.54792243128695]
"Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets.<n>We leverage VLMs and LLMs to analyze and debias benchmarks from representation biases.<n>We conduct a systematic analysis of 12 popular video classification and retrieval datasets.<n>We benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models.
arXiv Detail & Related papers (2025-03-24T13:00:25Z) - debiaSAE: Benchmarking and Mitigating Vision-Language Model Bias [1.3995965887921709]
We analyze demographic biases across five models and six datasets.<n>Portrait datasets like UTKFace and CelebA are the best tools for bias detection.<n>Our debiasing method improves fairness, gaining 5-15 points in performance over the baseline.
arXiv Detail & Related papers (2024-10-17T02:03:27Z) - Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal
Intervention [72.12974259966592]
We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips.
We propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets.
arXiv Detail & Related papers (2023-09-17T15:58:27Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Towards Debiasing Temporal Sentence Grounding in Video [59.42702544312366]
temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query.
Without considering bias in moment annotations, many models tend to capture statistical regularities of the moment annotations.
We propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions.
arXiv Detail & Related papers (2021-11-08T08:18:25Z) - Greedy Gradient Ensemble for Robust Visual Question Answering [163.65789778416172]
We stress the language bias in Visual Question Answering (VQA) that comes from two aspects, i.e., distribution bias and shortcut bias.
We propose a new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning.
GGE forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models.
arXiv Detail & Related papers (2021-07-27T08:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.