Cops-Ref: A new Dataset and Task on Compositional Referring Expression
Comprehension
- URL: http://arxiv.org/abs/2003.00403v1
- Date: Sun, 1 Mar 2020 04:59:38 GMT
- Title: Cops-Ref: A new Dataset and Task on Compositional Referring Expression
Comprehension
- Authors: Zhenfang Chen, Peng Wang, Lin Ma, Kwan-Yee K. Wong, Qi Wu
- Abstract summary: Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression.
Some popular referring expression datasets fail to provide an ideal test bed for evaluating the reasoning ability of the models.
We propose a new dataset for visual reasoning in context of referring expression comprehension with two main features.
- Score: 39.40351938417889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring expression comprehension (REF) aims at identifying a particular
object in a scene by a natural language expression. It requires joint reasoning
over the textual and visual domains to solve the problem. Some popular
referring expression datasets, however, fail to provide an ideal test bed for
evaluating the reasoning ability of the models, mainly because 1) their
expressions typically describe only some simple distinctive properties of the
object and 2) their images contain limited distracting information. To bridge
the gap, we propose a new dataset for visual reasoning in context of referring
expression comprehension with two main features. First, we design a novel
expression engine rendering various reasoning logics that can be flexibly
combined with rich visual properties to generate expressions with varying
compositionality. Second, to better exploit the full reasoning chain embodied
in an expression, we propose a new test setting by adding additional
distracting images containing objects sharing similar properties with the
referent, thus minimising the success rate of reasoning-free cross-domain
alignment. We evaluate several state-of-the-art REF models, but find none of
them can achieve promising performance. A proposed modular hard mining strategy
performs the best but still leaves substantial room for improvement. We hope
this new dataset and task can serve as a benchmark for deeper visual reasoning
analysis and foster the research on referring expression comprehension.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - Adversarial Testing for Visual Grounding via Image-Aware Property
Reduction [12.745111000109178]
PEELING is a text perturbation approach via image-aware property reduction for adversarial testing of the Visual Grounding model.
It achieves 21.4% in MultiModal Impact score (MMI), and outperforms state-of-the-art baselines for images and texts by 8.2%--15.1%.
arXiv Detail & Related papers (2024-03-02T08:03:42Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know
How to Reason? [30.16956370267339]
We introduce a protocol to evaluate visual representations for the task of Visual Question Answering.
In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module.
We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth.
arXiv Detail & Related papers (2022-12-20T14:36:45Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Referring Expression Comprehension: A Survey of Methods and Datasets [20.42495629501261]
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language.
We first examine the state of the art by comparing modern approaches to the problem.
We discuss modular architectures and graph-based models that interface with structured graph representation.
arXiv Detail & Related papers (2020-07-19T01:45:02Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z) - Graph-Structured Referring Expression Reasoning in The Wild [105.95488002374158]
Grounding referring expressions aims to locate in an image an object referred to by a natural language expression.
We propose a scene graph guided modular network (SGMN) to perform reasoning over a semantic graph and a scene graph.
We also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning.
arXiv Detail & Related papers (2020-04-19T11:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.