SHOP-VRB: A Visual Reasoning Benchmark for Object Perception
- URL: http://arxiv.org/abs/2004.02673v1
- Date: Mon, 6 Apr 2020 13:46:54 GMT
- Title: SHOP-VRB: A Visual Reasoning Benchmark for Object Perception
- Authors: Michal Nazarczuk and Krystian Mikolajczyk
- Abstract summary: We present an approach and a benchmark for visual reasoning in robotics applications.
We focus on inferring object properties from visual and text data.
We propose a reasoning system based on symbolic program execution.
- Score: 26.422761228628698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we present an approach and a benchmark for visual reasoning in
robotics applications, in particular small object grasping and manipulation.
The approach and benchmark are focused on inferring object properties from
visual and text data. It concerns small household objects with their
properties, functionality, natural language descriptions as well as
question-answer pairs for visual reasoning queries along with their
corresponding scene semantic representations. We also present a method for
generating synthetic data which allows to extend the benchmark to other objects
or scenes and propose an evaluation protocol that is more challenging than in
the existing datasets. We propose a reasoning system based on symbolic program
execution. A disentangled representation of the visual and textual inputs is
obtained and used to execute symbolic programs that represent a 'reasoning
process' of the algorithm. We perform a set of experiments on the proposed
benchmark and compare to results for the state of the art methods. These
results expose the shortcomings of the existing benchmarks that may lead to
misleading conclusions on the actual performance of the visual reasoning
systems.
Related papers
- Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Do Text-to-Vis Benchmarks Test Real Use of Visualisations? [11.442971909006657]
This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.
Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.
One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark.
This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs.
arXiv Detail & Related papers (2024-07-29T06:13:28Z) - Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - XAI Benchmark for Visual Explanation [15.687509357300847]
We develop a benchmark for visual explanation, consisting of eight datasets with human explanation annotations.
We devise a visual explanation pipeline that includes data loading, explanation generation, and method evaluation.
Our proposed benchmarks facilitate a fair evaluation and comparison of visual explanation methods.
arXiv Detail & Related papers (2023-10-12T17:26:16Z) - Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know
How to Reason? [30.16956370267339]
We introduce a protocol to evaluate visual representations for the task of Visual Question Answering.
In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module.
We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth.
arXiv Detail & Related papers (2022-12-20T14:36:45Z) - Doubly Right Object Recognition: A Why Prompt for Visual Rationales [28.408764714247837]
We investigate whether computer vision models can also provide correct rationales for their predictions.
We propose a doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales.
arXiv Detail & Related papers (2022-12-12T19:25:45Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Referring Expression Comprehension: A Survey of Methods and Datasets [20.42495629501261]
Referring expression comprehension (REC) aims to localize a target object in an image described by a referring expression phrased in natural language.
We first examine the state of the art by comparing modern approaches to the problem.
We discuss modular architectures and graph-based models that interface with structured graph representation.
arXiv Detail & Related papers (2020-07-19T01:45:02Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.