DistFormer: Enhancing Local and Global Features for Monocular Per-Object
Distance Estimation
- URL: http://arxiv.org/abs/2401.03191v1
- Date: Sat, 6 Jan 2024 10:56:36 GMT
- Title: DistFormer: Enhancing Local and Global Features for Monocular Per-Object
Distance Estimation
- Authors: Aniello Panariello and Gianluca Mancusi and Fedy Haj Ali and Angelo
Porrello and Simone Calderara and Rita Cucchiara
- Abstract summary: Per-object distance estimation is crucial in safety-critical applications such as autonomous driving, surveillance, and robotics.
Existing approaches rely on two scales: local information (i.e., the bounding box proportions) or global information.
Our work aims to strengthen both local and global cues.
- Score: 35.6022448037063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate per-object distance estimation is crucial in safety-critical
applications such as autonomous driving, surveillance, and robotics. Existing
approaches rely on two scales: local information (i.e., the bounding box
proportions) or global information, which encodes the semantics of the scene as
well as the spatial relations with neighboring objects. However, these
approaches may struggle with long-range objects and in the presence of strong
occlusions or unusual visual patterns. In this respect, our work aims to
strengthen both local and global cues. Our architecture -- named DistFormer --
builds upon three major components acting jointly: i) a robust context encoder
extracting fine-grained per-object representations; ii) a masked
encoder-decoder module exploiting self-supervision to promote the learning of
useful per-object features; iii) a global refinement module that aggregates
object representations and computes a joint, spatially-consistent estimation.
To evaluate the effectiveness of DistFormer, we conduct experiments on the
standard KITTI dataset and the large-scale NuScenes and MOTSynth datasets. Such
datasets cover various indoor/outdoor environments, changing weather
conditions, appearances, and camera viewpoints. Our comprehensive analysis
shows that DistFormer outperforms existing methods. Moreover, we further delve
into its generalization capabilities, showing its regularization benefits in
zero-shot synth-to-real transfer.
Related papers
- Localization, balance and affinity: a stronger multifaceted collaborative salient object detector in remote sensing images [24.06927394483275]
We propose a stronger multifaceted collaborative salient object detector in ORSIs, termed LBA-MCNet.
The network focuses on accurately locating targets, balancing detailed features, and modeling image-level global context information.
arXiv Detail & Related papers (2024-10-31T14:50:48Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Persistent Homology Meets Object Unity: Object Recognition in Clutter [2.356908851188234]
Recognition of occluded objects in unseen and unstructured indoor environments is a challenging problem for mobile robots.
We propose a new descriptor, TOPS, for point clouds generated from depth images and an accompanying recognition framework, THOR, inspired by human reasoning.
THOR outperforms state-of-the-art methods on both the datasets and achieves substantially higher recognition accuracy for all the scenarios of the UW-IS Occluded dataset.
arXiv Detail & Related papers (2023-05-05T19:42:39Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Complex-Valued Autoencoders for Object Discovery [62.26260974933819]
We propose a distributed approach to object-centric representations: the Complex AutoEncoder.
We show that this simple and efficient approach achieves better reconstruction performance than an equivalent real-valued autoencoder on simple multi-object datasets.
We also show that it achieves competitive unsupervised object discovery performance to a SlotAttention model on two datasets, and manages to disentangle objects in a third dataset where SlotAttention fails - all while being 7-70 times faster to train.
arXiv Detail & Related papers (2022-04-05T09:25:28Z) - Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges.
We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible.
Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.