Related papers: Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks

Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks

URL: http://arxiv.org/abs/2209.09393v1
Date: Tue, 20 Sep 2022 00:30:35 GMT
Title: Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks
Authors: Haodong Duan, Yue Zhao, Kai Chen, Yuanjun Xiong, Dahua Lin
Abstract summary: Deep learning models perform poorly when applied to videos with rare scenes or objects. We tackle this problem from two different angles: algorithm and dataset. We show that the debiased representation can generalize better when transferred to other datasets and tasks.
Score: 76.35271072704384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning models have achieved excellent recognition results on large-scale video benchmarks. However, they perform poorly when applied to videos with rare scenes or objects, primarily due to the bias of existing video datasets. We tackle this problem from two different angles: algorithm and dataset. From the perspective of algorithms, we propose Spatial-aware Multi-Aspect Debiasing (SMAD), which incorporates both explicit debiasing with multi-aspect adversarial training and implicit debiasing with the spatial actionness reweighting module, to learn a more generic representation invariant to non-action aspects. To neutralize the intrinsic dataset bias, we propose OmniDebias to leverage web data for joint training selectively, which can achieve higher performance with far fewer web data. To verify the effectiveness, we establish evaluation protocols and perform extensive experiments on both re-distributed splits of existing datasets and a new evaluation dataset focusing on the action with rare scenes. We also show that the debiased representation can generalize better when transferred to other datasets and tasks.

Related papers

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks [85.54792243128695]
"Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets. We leverage VLMs and LLMs to analyze and debias benchmarks from representation biases. We conduct a systematic analysis of 12 popular video classification and retrieval datasets. We benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models.
arXiv Detail & Related papers (2025-03-24T13:00:25Z)
debiaSAE: Benchmarking and Mitigating Vision-Language Model Bias [1.3995965887921709]
We analyze demographic biases across five models and six datasets. Portrait datasets like UTKFace and CelebA are the best tools for bias detection. Our debiasing method improves fairness, gaining 5-15 points in performance over the baseline.
arXiv Detail & Related papers (2024-10-17T02:03:27Z)
Model Debiasing by Learnable Data Augmentation [19.625915578646758]
This paper proposes a novel 2-stage learning pipeline featuring a data augmentation strategy able to regularize the training. Experiments on synthetic and realistic biased datasets show state-of-the-art classification accuracy, outperforming competing methods.
arXiv Detail & Related papers (2024-08-09T09:19:59Z)
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond [93.96982273042296]
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions. We have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding. We propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data. We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation.
arXiv Detail & Related papers (2023-10-23T08:09:42Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
Adaptive graph convolutional networks for weakly supervised anomaly detection in videos [42.3118758940767]
We propose a weakly supervised adaptive graph convolutional network (WAGCN) to model the contextual relationships among video segments. We fully consider the influence of other video segments on the current segment when generating the anomaly probability score for each segment.
arXiv Detail & Related papers (2022-02-14T06:31:34Z)
Learning Bias-Invariant Representation by Cross-Sample Mutual Information Minimization [77.8735802150511]
We propose a cross-sample adversarial debiasing (CSAD) method to remove the bias information misused by the target task. The correlation measurement plays a critical role in adversarial debiasing and is conducted by a cross-sample neural mutual information estimator. We conduct thorough experiments on publicly available datasets to validate the advantages of the proposed method over state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-11T21:17:02Z)
Interventional Video Grounding with Dual Contrastive Learning [16.0734337895897]
Video grounding aims to localize a moment from an untrimmed video for a given textual query. We propose a novel paradigm from the perspective of causal inference to uncover the causality behind the model and data. We also introduce a dual contrastive learning approach to better align the text and video.
arXiv Detail & Related papers (2021-06-21T12:11:28Z)
Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. In this paper, we first study how biases in the dataset affect existing methods. We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos. Because video representation is important, we extend negative samples by introducing intra-negative samples. We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.