Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions
- URL: http://arxiv.org/abs/2005.01655v1
- Date: Mon, 4 May 2020 17:09:15 GMT
- Title: Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions
- Authors: Arjun R Akula, Spandana Gella, Yaser Al-Onaizan, Song-Chun Zhu, Siva
Reddy
- Abstract summary: We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
- Score: 87.33156149634392
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual referring expression recognition is a challenging task that requires
natural language understanding in the context of an image. We critically
examine RefCOCOg, a standard benchmark for this task, using a human study and
show that 83.7% of test instances do not require reasoning on linguistic
structure, i.e., words are enough to identify the target object, the word order
doesn't matter. To measure the true progress of existing models, we split the
test set into two sets, one which requires reasoning on linguistic structure
and the other which doesn't. Additionally, we create an out-of-distribution
dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that
the target object changes. Using these datasets, we empirically show that
existing methods fail to exploit linguistic structure and are 12% to 23% lower
in performance than the established progress for this task. We also propose two
methods, one based on contrastive learning and the other based on multi-task
learning, to increase the robustness of ViLBERT, the current state-of-the-art
model for this task. Our datasets are publicly available at
https://github.com/aws/aws-refcocog-adv
Related papers
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control [43.860799289234755]
We propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against emphmagnitude feature dictionaries.
First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task.
We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets.
arXiv Detail & Related papers (2024-05-14T07:07:13Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Universal Instance Perception as Object Discovery and Retrieval [90.96031157557806]
UNI reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm.
It can flexibly perceive different types of objects by simply changing the input prompts.
UNI shows superior performance on 20 challenging benchmarks from 10 instance-level tasks.
arXiv Detail & Related papers (2023-03-12T14:28:24Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding [35.01174511816063]
We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training.
Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images.
We develop a visual-language model equipped with multi-level cross-modality attention mechanism.
arXiv Detail & Related papers (2022-03-16T09:17:41Z) - Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on
Spoken Language Understanding [101.24748444126982]
Decomposable tasks are complex and comprise of a hierarchy of sub-tasks.
Existing benchmarks, however, typically hold out examples for only the surface-level sub-task.
We propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions.
arXiv Detail & Related papers (2021-06-29T02:53:59Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Multi-task Learning of Negation and Speculation for Targeted Sentiment
Classification [15.85111852764517]
We show that targeted sentiment models are not robust to linguistic phenomena, specifically negation and speculation.
We propose a multi-task learning method to incorporate information from syntactic and semantic auxiliary tasks, including negation and speculation scope detection.
We create two challenge datasets to evaluate model performance on negated and speculative samples.
arXiv Detail & Related papers (2020-10-16T11:20:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.