Reducing the Vision and Language Bias for Temporal Sentence Grounding
- URL: http://arxiv.org/abs/2207.13457v1
- Date: Wed, 27 Jul 2022 11:18:45 GMT
- Title: Reducing the Vision and Language Bias for Temporal Sentence Grounding
- Authors: Daizong Liu, Xiaoye Qu, Wei Hu
- Abstract summary: We propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities.
We demonstrate its effectiveness by achieving the state-of-the-art performance on three benchmark datasets.
- Score: 22.571577672704716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal sentence grounding (TSG) is an important yet challenging task in
multimedia information retrieval. Although previous TSG methods have achieved
decent performance, they tend to capture the selection biases of frequently
appeared video-query pairs in the dataset rather than present robust multimodal
reasoning abilities, especially for the rarely appeared pairs. In this paper,
we study the above issue of selection biases and accordingly propose a
Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both
vision and language modalities for enhancing the model generalization ability.
Specifically, we propose to alleviate the issue from two perspectives: 1)
Feature distillation. We built a multi-modal debiasing branch to firstly
capture the vision and language biases, and then apply a bias identification
module to explicitly recognize the true negative biases and remove them from
the benign multi-modal representations. 2) Contrastive sample generation. We
construct two types of negative samples to enforce the model to accurately
learn the aligned multi-modal semantics and make complete semantic reasoning.
We apply the proposed model to both commonly and rarely appeared TSG cases, and
demonstrate its effectiveness by achieving the state-of-the-art performance on
three benchmark datasets (ActivityNet Caption, TACoS, and Charades-STA).
Related papers
- Covert Bias: The Severity of Social Views' Unalignment in Language Models Towards Implicit and Explicit Opinion [0.40964539027092917]
We evaluate the severity of bias toward a view by using a biased model in edge cases of excessive bias scenarios.
Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances.
The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness.
arXiv Detail & Related papers (2024-08-15T15:23:00Z) - VersusDebias: Universal Zero-Shot Debiasing for Text-to-Image Models via SLM-Based Prompt Engineering and Generative Adversary [8.24274551090375]
We introduce VersusDebias, a novel and universal debiasing framework for biases in arbitrary Text-to-Image (T2I) models.
The self-adaptive module generates specialized attribute arrays to post-process hallucinations and debias multiple attributes simultaneously.
In both zero-shot and few-shot scenarios, VersusDebias outperforms existing methods, showcasing its exceptional utility.
arXiv Detail & Related papers (2024-07-28T16:24:07Z) - Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy
for Temporal Sentence Grounding in Video [67.24316233946381]
Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue.
We propose the bias-conflict sample synthesis and adversarial removal debias strategy (BSSARD)
arXiv Detail & Related papers (2024-01-15T09:59:43Z) - General Debiasing for Multimodal Sentiment Analysis [47.05329012210878]
We propose a general debiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD) generalization ability of MSA models.
We employ IPW to reduce the effects of large-biased samples, facilitating robust feature learning for sentiment prediction.
The empirical results demonstrate the superior generalization ability of our proposed framework.
arXiv Detail & Related papers (2023-07-20T00:36:41Z) - Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target.
The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well.
We propose Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.
arXiv Detail & Related papers (2022-09-10T19:04:40Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Learning Sample Importance for Cross-Scenario Video Temporal Grounding [30.82619216537177]
The paper investigates some superficial biases specific to the temporal grounding task.
We propose a novel method called Debiased Temporal Language Localizer (DebiasTLL) to prevent the model from naively memorizing the biases.
We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced.
arXiv Detail & Related papers (2022-01-08T15:41:38Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Towards Debiasing Temporal Sentence Grounding in Video [59.42702544312366]
temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query.
Without considering bias in moment annotations, many models tend to capture statistical regularities of the moment annotations.
We propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions.
arXiv Detail & Related papers (2021-11-08T08:18:25Z) - Greedy Gradient Ensemble for Robust Visual Question Answering [163.65789778416172]
We stress the language bias in Visual Question Answering (VQA) that comes from two aspects, i.e., distribution bias and shortcut bias.
We propose a new de-bias framework, Greedy Gradient Ensemble (GGE), which combines multiple biased models for unbiased base model learning.
GGE forces the biased models to over-fit the biased data distribution in priority, thus makes the base model pay more attention to examples that are hard to solve by biased models.
arXiv Detail & Related papers (2021-07-27T08:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.