Related papers: Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

URL: http://arxiv.org/abs/2503.18637v1
Date: Mon, 24 Mar 2025 13:00:25 GMT
Title: Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Authors: Nina Shvetsova, Arsha Nagrani, Bernt Schiele, Hilde Kuehne, Christian Rupprecht,
Abstract summary: "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets.<n>We leverage VLMs and LLMs to analyze and debias benchmarks from representation biases.<n>We conduct a systematic analysis of 12 popular video classification and retrieval datasets.<n>We benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models.
Score: 85.54792243128695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.

Related papers

debiaSAE: Benchmarking and Mitigating Vision-Language Model Bias [1.3995965887921709]
We analyze demographic biases across five models and six datasets. Portrait datasets like UTKFace and CelebA are the best tools for bias detection. Our debiasing method improves fairness, gaining 5-15 points in performance over the baseline.
arXiv Detail & Related papers (2024-10-17T02:03:27Z)
Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention [72.12974259966592]
We present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips. We propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets.
arXiv Detail & Related papers (2023-09-17T15:58:27Z)
Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects. We tackle this problem from two different angles: algorithm and dataset. We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z)
Meta Spatio-Temporal Debiasing for Video Scene Graph Generation [22.216881800098726]
We propose a novel Meta Video Scene Generation (MVSGG) framework to address bias problem. Our framework first constructs a support set and a group query sets from the training data. Then, by performing a meta training and testing process to optimize the model, our framework can effectively guide the model to learn well against biases.
arXiv Detail & Related papers (2022-07-23T07:06:06Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
Towards Debiasing Temporal Sentence Grounding in Video [59.42702544312366]
temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query. Without considering bias in moment annotations, many models tend to capture statistical regularities of the moment annotations. We propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions.
arXiv Detail & Related papers (2021-11-08T08:18:25Z)
Greedy Gradient Ensemble for Robust Visual Question Answering [163.65789778416172]
We stress the language bias in Visual Question Answering (VQA) that comes from two aspects, i.e., distribution bias and shortcut bias. We propose a new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning. GGE forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models.
arXiv Detail & Related papers (2021-07-27T08:02:49Z)
Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query. We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data. We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.