Incorporating granularity bias as the margin into contrastive loss for
video captioning
- URL: http://arxiv.org/abs/2311.14977v1
- Date: Sat, 25 Nov 2023 09:38:24 GMT
- Title: Incorporating granularity bias as the margin into contrastive loss for
video captioning
- Authors: Jiayang Gu, Fengming Yao
- Abstract summary: Long-tail distribution of phrases makes captioning models prone to generate vague sentences instead of accurate ones.
We introduce a statistical-based bias extractor to estimate the likelihood that a video-sentence pair is affected by granularity bias.
We then incorporate the margin score into the contrastive learning loss, establishing training objectives for head and tail sentences.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning models easily suffer from long-tail distribution of phrases,
which makes captioning models prone to generate vague sentences instead of
accurate ones. However, existing debiasing strategies tend to export external
knowledge to build dependency trees of words or refine frequency distribution
by complex losses and extra input features, which lack interpretability and are
hard to train. To mitigate the impact of granularity bias on the model, we
introduced a statistical-based bias extractor. This extractor quantifies the
information content within sentences and videos, providing an estimate of the
likelihood that a video-sentence pair is affected by granularity bias.
Furthermore, with the growing trend of integrating contrastive learning methods
into video captioning tasks, we use a bidirectional triplet loss to get more
negative samples in a batch. Subsequently, we incorporate the margin score into
the contrastive learning loss, establishing distinct training objectives for
head and tail sentences. This approach facilitates the model's training
effectiveness on tail samples. Our simple yet effective loss, incorporating
Granularity bias, is referred to as the Margin-Contrastive Loss (GMC Loss). The
proposed model demonstrates state-of-the-art performance on MSRVTT with a CIDEr
of 57.17, and MSVD, where CIDEr reaches up to 138.68.
Related papers
- Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations [7.052925981783274]
We propose a novel debiasing framework for LVLMs by directly ablating biased attributes during text generation.
Our method requires no training and a relatively small amount of representative biased outputs.
Our experiments show that not only can we can minimize the propensity of LVLMs to generate text related to protected attributes, but we can even use synthetic data to inform the ablation.
arXiv Detail & Related papers (2024-10-17T19:02:31Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Unmasking Bias in Diffusion Model Training [40.90066994983719]
Denoising diffusion models have emerged as a dominant approach for image generation.
They still suffer from slow convergence in training and color shift issues in sampling.
In this paper, we identify that these obstacles can be largely attributed to bias and suboptimality inherent in the default training paradigm.
arXiv Detail & Related papers (2023-10-12T16:04:41Z) - TDCGL: Two-Level Debiased Contrastive Graph Learning for Recommendation [1.5836776102398225]
Long-tailed distribution of entities of KG and noise issues in the real world make item-entity dependent relations deviate from reflecting true characteristics.
We design the Two-Level Debiased Contrastive Learning (TDCL) and deploy it in the knowledge graph.
Considerable experiments on open-source datasets demonstrate that our method has excellent anti-noise capability.
arXiv Detail & Related papers (2023-10-01T03:56:38Z) - Feature-Level Debiased Natural Language Understanding [86.8751772146264]
Existing natural language understanding (NLU) models often rely on dataset biases to achieve high performance on specific datasets.
We propose debiasing contrastive learning (DCT) to mitigate biased latent features and neglect the dynamic nature of bias.
DCT outperforms state-of-the-art baselines on out-of-distribution datasets while maintaining in-distribution performance.
arXiv Detail & Related papers (2022-12-11T06:16:14Z) - Exploring the Impact of Negative Samples of Contrastive Learning: A Case
Study of Sentence Embedding [14.295787044482136]
We present a momentum contrastive learning model with negative sample queue for sentence embedding, namely MoCoSE.
We define a maximum traceable distance metric, through which we learn to what extent the text contrastive learning benefits from the historical information of negative samples.
Our experiments find that the best results are obtained when the maximum traceable distance is at a certain range, demonstrating that there is an optimal range of historical information for a negative sample queue.
arXiv Detail & Related papers (2022-02-26T08:29:25Z) - Robust Audio-Visual Instance Discrimination [79.74625434659443]
We present a self-supervised learning method to learn audio and video representations.
We address the problems of audio-visual instance discrimination and improve transfer learning performance.
arXiv Detail & Related papers (2021-03-29T19:52:29Z) - Improving Robustness by Augmenting Training Sentences with
Predicate-Argument Structures [62.562760228942054]
Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective.
We propose to augment the input sentences in the training data with their corresponding predicate-argument structures.
We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases.
arXiv Detail & Related papers (2020-10-23T16:22:05Z) - A Simple but Tough-to-Beat Data Augmentation Approach for Natural
Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff.
cutoff relies on sampling consistency and thus adds little computational overhead.
cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.