On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
- URL: http://arxiv.org/abs/2404.08540v1
- Date: Fri, 12 Apr 2024 15:35:20 GMT
- Title: On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
- Authors: Agneet Chatterjee, Tejas Gokhale, Chitta Baral, Yezhou Yang,
- Abstract summary: We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation.
Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions.
Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift.
- Score: 71.72465617754553
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, remains unexplored. In this paper, we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness across various settings. We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation. Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions and counter-intuitively fare worse with low level descriptions. Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift. Finally, to provide a foundation for future research, we identify points of failures and offer insights to better understand these shortcomings. With an increasing number of methods using language for depth estimation, our findings highlight the opportunities and pitfalls that require careful consideration for effective deployment in real-world settings
Related papers
- Natural Language Inference Improves Compositionality in Vision-Language Models [35.71815423077561]
We present a principled approach that generates entailments and contradictions from a given premise.
CECE produces lexically diverse sentences while maintaining their core meaning.
We achieve significant improvements over previous methods without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-10-29T17:54:17Z) - Understanding and Addressing the Under-Translation Problem from the Perspective of Decoding Objective [72.83966378613238]
Under-translation and over-translation remain two challenging problems in state-of-the-art Neural Machine Translation (NMT) systems.
We conduct an in-depth analysis on the underlying cause of under-translation in NMT, providing an explanation from the perspective of decoding objective.
We propose employing the confidence of predicting End Of Sentence (EOS) as a detector for under-translation, and strengthening the confidence-based penalty to penalize candidates with a high risk of under-translation.
arXiv Detail & Related papers (2024-05-29T09:25:49Z) - Analyzing and Adapting Large Language Models for Few-Shot Multilingual
NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z) - Guiding Computational Stance Detection with Expanded Stance Triangle
Framework [25.2980607215715]
Stance detection determines whether the author of a piece of text is in favor of, against, or neutral towards a specified target.
We decompose the stance detection task from a linguistic perspective, and investigate key components and inference paths in this task.
arXiv Detail & Related papers (2023-05-31T13:33:29Z) - TAPE: Assessing Few-shot Russian Language Understanding [1.9859374437454114]
TAPE (Text Attack and Perturbation Evaluation) is a novel benchmark that includes six more complex NLU tasks for Russian.
The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most.
We publicly release TAPE to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.
arXiv Detail & Related papers (2022-10-23T18:28:25Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - Towards explainable evaluation of language models on the semantic
similarity of visual concepts [0.0]
We examine the behavior of high-performing pre-trained language models, focusing on the task of semantic similarity for visual vocabularies.
First, we address the need for explainable evaluation metrics, necessary for understanding the conceptual quality of retrieved instances.
Secondly, adversarial interventions on salient query semantics expose vulnerabilities of opaque metrics and highlight patterns in learned linguistic representations.
arXiv Detail & Related papers (2022-09-08T11:40:57Z) - A Simple but Tough-to-Beat Data Augmentation Approach for Natural
Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff.
cutoff relies on sampling consistency and thus adds little computational overhead.
cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z) - Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm.
BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense.
BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z) - On the uncertainty of self-supervised monocular depth estimation [52.13311094743952]
Self-supervised paradigms for monocular depth estimation are very appealing since they do not require ground truth annotations at all.
We explore for the first time how to estimate the uncertainty for this task and how this affects depth accuracy.
We propose a novel peculiar technique specifically designed for self-supervised approaches.
arXiv Detail & Related papers (2020-05-13T09:00:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.