On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
- URL: http://arxiv.org/abs/2404.08540v1
- Date: Fri, 12 Apr 2024 15:35:20 GMT
- Title: On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
- Authors: Agneet Chatterjee, Tejas Gokhale, Chitta Baral, Yezhou Yang,
- Abstract summary: We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation.
Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions.
Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift.
- Score: 71.72465617754553
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results, the impact of the language prior, particularly in terms of generalization and robustness, remains unexplored. In this paper, we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness across various settings. We generate "low-level" sentences that convey object-centric, three-dimensional spatial relationships, incorporate them as additional language priors and evaluate their downstream impact on depth estimation. Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions and counter-intuitively fare worse with low level descriptions. Despite leveraging additional data, these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift. Finally, to provide a foundation for future research, we identify points of failures and offer insights to better understand these shortcomings. With an increasing number of methods using language for depth estimation, our findings highlight the opportunities and pitfalls that require careful consideration for effective deployment in real-world settings
Related papers
- Analyzing and Adapting Large Language Models for Few-Shot Multilingual
NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z) - Evaluation and Enhancement of Semantic Grounding in Large
Vision-Language Models [25.413601452403213]
Large Vision-Language Models (LVLMs) offer remarkable benefits for a variety of vision-language tasks.
Their constrained semantic grounding ability hinders their application in real-world scenarios.
We propose a data-centric enhancement method that aims to improve LVLMs' semantic grounding ability.
arXiv Detail & Related papers (2023-09-07T22:59:56Z) - Guiding Computational Stance Detection with Expanded Stance Triangle
Framework [25.2980607215715]
Stance detection determines whether the author of a piece of text is in favor of, against, or neutral towards a specified target.
We decompose the stance detection task from a linguistic perspective, and investigate key components and inference paths in this task.
arXiv Detail & Related papers (2023-05-31T13:33:29Z) - TAPE: Assessing Few-shot Russian Language Understanding [1.9859374437454114]
TAPE (Text Attack and Perturbation Evaluation) is a novel benchmark that includes six more complex NLU tasks for Russian.
The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most.
We publicly release TAPE to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.
arXiv Detail & Related papers (2022-10-23T18:28:25Z) - Sentence Representation Learning with Generative Objective rather than
Contrastive Objective [86.01683892956144]
We propose a novel generative self-supervised learning objective based on phrase reconstruction.
Our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-10-16T07:47:46Z) - Towards explainable evaluation of language models on the semantic
similarity of visual concepts [0.0]
We examine the behavior of high-performing pre-trained language models, focusing on the task of semantic similarity for visual vocabularies.
First, we address the need for explainable evaluation metrics, necessary for understanding the conceptual quality of retrieved instances.
Secondly, adversarial interventions on salient query semantics expose vulnerabilities of opaque metrics and highlight patterns in learned linguistic representations.
arXiv Detail & Related papers (2022-09-08T11:40:57Z) - Understanding and Mitigating the Uncertainty in Zero-Shot Translation [66.49354195238789]
We aim to understand and alleviate the off-target issues from the perspective of uncertainty in zero-shot translation.
By carefully examining the translation output and model confidence, we identify two uncertainties that are responsible for the off-target issues.
We propose two light-weight and complementary approaches to denoise the training data for model training, and mask out the vocabulary of the off-target languages in inference.
arXiv Detail & Related papers (2022-05-20T10:29:46Z) - A Simple but Tough-to-Beat Data Augmentation Approach for Natural
Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff.
cutoff relies on sampling consistency and thus adds little computational overhead.
cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z) - Analysis and Evaluation of Language Models for Word Sense Disambiguation [18.001457030065712]
Transformer-based language models have taken many fields in NLP by storm.
BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense.
BERT and its derivatives dominate most of the existing evaluation benchmarks.
arXiv Detail & Related papers (2020-08-26T15:07:07Z) - On the uncertainty of self-supervised monocular depth estimation [52.13311094743952]
Self-supervised paradigms for monocular depth estimation are very appealing since they do not require ground truth annotations at all.
We explore for the first time how to estimate the uncertainty for this task and how this affects depth accuracy.
We propose a novel peculiar technique specifically designed for self-supervised approaches.
arXiv Detail & Related papers (2020-05-13T09:00:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.