Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual
Grounding
- URL: http://arxiv.org/abs/2003.08717v3
- Date: Wed, 26 May 2021 12:08:13 GMT
- Title: Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual
Grounding
- Authors: Thierry Deruyttere, Guillem Collell, Marie-Francine Moens
- Abstract summary: We propose a new spatial memory module and a spatial reasoner for the Visual Grounding (VG) task.
The goal of this task is to find a certain object in an image based on a given textual query.
Our work focuses on integrating the regions of a Region Proposal Network (RPN) into a new multi-step reasoning model.
- Score: 19.48363193759392
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a new spatial memory module and a spatial reasoner for the Visual
Grounding (VG) task. The goal of this task is to find a certain object in an
image based on a given textual query. Our work focuses on integrating the
regions of a Region Proposal Network (RPN) into a new multi-step reasoning
model which we have named a Multimodal Spatial Region Reasoner (MSRR). The
introduced model uses the object regions from an RPN as initialization of a 2D
spatial memory and then implements a multi-step reasoning process scoring each
region according to the query, hence why we call it a multimodal reasoner. We
evaluate this new model on challenging datasets and our experiments show that
our model that jointly reasons over the object regions of the image and words
of the query largely improves accuracy compared to current state-of-the-art
models.
Related papers
- ReMI: A Dataset for Reasoning with Multiple Images [41.954830849939526]
We introduce ReMI, a dataset designed to assess large language models' ability to Reason with Multiple Images.
This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning.
We have benchmarked several cutting-edge LLMs and found a substantial gap between their performance and human-level proficiency.
arXiv Detail & Related papers (2024-06-13T14:37:04Z) - DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution [54.05367433562495]
Region-level multi-modality methods can translate referred image regions to human preferred language descriptions.
Unfortunately, most of existing methods using fixed visual inputs remain lacking the resolution adaptability to find out precise language descriptions.
We propose a dynamic resolution approach, referred to as DynRefer, to pursue high-accuracy region-level referring.
arXiv Detail & Related papers (2024-05-25T05:44:55Z) - Few-shot Object Localization [37.347898735345574]
This paper defines a novel task named Few-Shot Object localization (FSOL)
It aims to achieve precise localization with limited samples.
This task achieves generalized object localization by leveraging a small number of labeled support samples to query the positional information of objects within corresponding images.
Experimental results demonstrate a significant performance improvement of our approach in the FSOL task, establishing an efficient benchmark for further research.
arXiv Detail & Related papers (2024-03-19T05:50:48Z) - Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation [1.2473780585666772]
Most Vision-and-Language Navigation (VLN) algorithms tend to make decision errors, primarily due to a lack of visual common sense and insufficient reasoning capabilities.
This paper proposes a Hierarchical Spatial Proximity Reasoning (HSPR) model to address this issue.
We conduct experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R to validate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-03-18T07:51:22Z) - ChatterBox: Multi-round Multimodal Referring and Grounding [108.9673313949746]
We present a new benchmark and an efficient vision-language model for this purpose.
The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks.
Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively.
arXiv Detail & Related papers (2024-01-24T09:02:00Z) - GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z) - Language-Guided 3D Object Detection in Point Cloud for Autonomous
Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding.
It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector.
Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z) - DQnet: Cross-Model Detail Querying for Camouflaged Object Detection [54.82390534024954]
A convolutional neural network (CNN) for camouflaged object detection tends to activate local discriminative regions while ignoring complete object extent.
In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN.
In order to obtain feature maps that could activate full object extent, a novel framework termed Cross-Model Detail Querying network (DQnet) is proposed.
arXiv Detail & Related papers (2022-12-16T06:23:58Z) - RLM-Tracking: Online Multi-Pedestrian Tracking Supported by Relative
Location Mapping [5.9669075749248774]
Problem of multi-object tracking is a fundamental computer vision research focus, widely used in public safety, transport, autonomous vehicles, robotics, and other regions involving artificial intelligence.
In this paper, we design a new multi-object tracker for the above issues that contains an object textbfRelative Location Mapping (RLM) model and textbfTarget Region Density (TRD) model.
The new tracker is more sensitive to the differences in position relationships between objects.
It can introduce low-score detection frames into different regions in real-time according to the density of object
arXiv Detail & Related papers (2022-10-19T11:37:14Z) - Scale-Localized Abstract Reasoning [79.00011351374869]
We consider the abstract relational reasoning task, which is commonly used as an intelligence test.
Since some patterns have spatial rationales, while others are only semantic, we propose a multi-scale architecture that processes each query in multiple resolutions.
We show that indeed different rules are solved by different resolutions and a combined multi-scale approach outperforms the existing state of the art in this task on all benchmarks by 5-54%.
arXiv Detail & Related papers (2020-09-20T10:37:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.