Building Category Graphs Representation with Spatial and Temporal
Attention for Visual Navigation
- URL: http://arxiv.org/abs/2312.03327v1
- Date: Wed, 6 Dec 2023 07:28:43 GMT
- Title: Building Category Graphs Representation with Spatial and Temporal
Attention for Visual Navigation
- Authors: Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv
- Abstract summary: Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations.
To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen environment.
We propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects helping the
- Score: 35.13932194789583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an object of interest, visual navigation aims to reach the object's
location based on a sequence of partial observations. To this end, an agent
needs to 1) learn a piece of certain knowledge about the relations of object
categories in the world during training and 2) look for the target object based
on the pre-learned object category relations and its moving trajectory in the
current unseen environment. In this paper, we propose a Category Relation Graph
(CRG) to learn the knowledge of object category layout relations and a
Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term
spatial-temporal dependencies of objects helping the navigation. We learn prior
knowledge of object layout, establishing a category relationship graph to
deduce the positions of specific objects. Subsequently, we introduced TSR to
capture the relationships of objects in temporal, spatial, and regions within
the observation trajectories. Specifically, we propose a Temporal attention
module (T) to model the temporal structure of the observation sequence, which
implicitly encodes the historical moving or trajectory information. Then, a
Spatial attention module (S) is used to uncover the spatial context of the
current observation objects based on the category relation graph and past
observations. Last, a Region attention module (R) shifts the attention to the
target-relevant region. Based on the visual representation extracted by our
method, the agent can better perceive the environment and easily learn superior
navigation policy. Experiments on AI2-THOR demonstrate our CRG-TSR method
significantly outperforms existing methods regarding both effectiveness and
efficiency. The code has been included in the supplementary material and will
be publicly available.
Related papers
- STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking [13.269416985959404]
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision.
We propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT)
We use historical embedding features to model the representation of ReID and detection features in a sequential order.
Our framework sets a new state-of-the-art performance in MOTA and IDF1 metrics.
arXiv Detail & Related papers (2024-09-17T14:34:18Z) - Temporal-Spatial Object Relations Modeling for Vision-and-Language Navigation [11.372544701050044]
Vision-and-Language Navigation (VLN) is a challenging task where an agent is required to navigate to a natural language described location via vision observations.
The navigation abilities of the agent can be enhanced by the relations between objects, which are usually learned using internal objects or external datasets.
arXiv Detail & Related papers (2024-03-23T02:44:43Z) - How To Not Train Your Dragon: Training-free Embodied Object Goal
Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI.
Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework.
Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z) - Tracking Objects and Activities with Attention for Temporal Sentence
Grounding [51.416914256782505]
Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment.
We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
arXiv Detail & Related papers (2023-02-21T16:42:52Z) - Spatio-Temporal Relation Learning for Video Anomaly Detection [35.59510027883497]
Anomaly identification is highly dependent on the relationship between the object and the scene.
In this paper, we propose a Spatial-Temporal Relation Learning framework to tackle the video anomaly detection task.
Experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2022-09-27T02:19:31Z) - Bi-directional Object-context Prioritization Learning for Saliency
Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations.
We observe that spatial attention works concurrently with object-based attention in the human visual recognition system.
We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z) - Visual Navigation with Spatial Attention [26.888916048408895]
This work focuses on object goal visual navigation, aiming at finding the location of an object from a given class.
We propose to learn the agent's policy using a reinforcement learning algorithm.
Our key contribution is a novel attention probability model for visual navigation tasks.
arXiv Detail & Related papers (2021-04-20T07:39:52Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.