Related papers: Enhancing Landmark Detection in Cluttered Real-World Scenarios with Vision Transformers

Enhancing Landmark Detection in Cluttered Real-World Scenarios with Vision Transformers

URL: http://arxiv.org/abs/2308.13671v1
Date: Fri, 25 Aug 2023 21:01:01 GMT
Title: Enhancing Landmark Detection in Cluttered Real-World Scenarios with Vision Transformers
Authors: Mohammad Javad Rajabi, Morteza Mirzai, Ahmad Nickabadi
Abstract summary: This research contributes to the advancement of landmark detection in visual place recognition. It shows the potential of leveraging vision transformers to overcome challenges posed by cluttered real-world scenarios.
Score: 2.900522306460408
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Visual place recognition tasks often encounter significant challenges in landmark detection due to the presence of irrelevant objects such as humans, cars, and trees, despite the remarkable progress achieved by previous models, especially in the context of transformers. To address this issue, we propose a novel method that effectively leverages the strengths of vision transformers. By employing a meticulous selection process, our approach identifies and isolates specific patches within the image that correspond to occluding objects. To evaluate the efficacy of our method, we created augmented datasets and conducted comprehensive testing. The results demonstrate the superior accuracy achieved by our proposed approach. This research contributes to the advancement of landmark detection in visual place recognition and shows the potential of leveraging vision transformers to overcome challenges posed by cluttered real-world scenarios.

Related papers

Image Recognition with Online Lightweight Vision Transformer: A Survey [53.005965123414576]
This paper surveys various online strategies for generating lightweight vision transformers for image recognition.<n>We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark.<n>We propose future research directions and potential challenges in the lightweighting of vision transformers.
arXiv Detail & Related papers (2025-05-06T02:07:54Z)
BOOTPLACE: Bootstrapped Object Placement with Detection Transformers [23.300369070771836]
We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning.
arXiv Detail & Related papers (2025-03-27T21:21:20Z)
LEAP:D - A Novel Prompt-based Approach for Domain-Generalized Aerial Object Detection [2.1233286062376497]
We introduce an innovative vision-language approach using learnable prompts. This shift from conventional manual prompts aims to reduce domain-specific knowledge interference. We streamline the training process with a one-step approach, updating the learnable prompt concurrently with model training.
arXiv Detail & Related papers (2024-11-14T04:39:10Z)
A Simple Background Augmentation Method for Object Detection with Diffusion Model [53.32935683257045]
In computer vision, it is well-known that a lack of data diversity will impair model performance. We propose a simple yet effective data augmentation approach by leveraging advancements in generative models. Background augmentation, in particular, significantly improves the models' robustness and generalization capabilities.
arXiv Detail & Related papers (2024-08-01T07:40:00Z)
DeepFidelity: Perceptual Forgery Fidelity Assessment for Deepfake Detection [67.3143177137102]
Deepfake detection refers to detecting artificially generated or edited faces in images or videos. We propose a novel Deepfake detection framework named DeepFidelity to adaptively distinguish real and fake faces.
arXiv Detail & Related papers (2023-12-07T07:19:45Z)
Improved TokenPose with Sparsity [0.0]
We introduce sparsity in both keypoint token attention and visual token attention to improve human pose estimation. Experimental results on the MPII dataset demonstrate that our model has a higher level of accuracy and proved the feasibility of the method.
arXiv Detail & Related papers (2023-11-16T08:12:34Z)
Fusing Pseudo Labels with Weak Supervision for Dynamic Traffic Scenarios [0.0]
We introduce a weakly-supervised label unification pipeline that amalgamates pseudo labels from object detection models trained on heterogeneous datasets. Our pipeline engenders a unified label space through the amalgamation of labels from disparate datasets, rectifying bias and enhancing generalization. We retrain a solitary object detection model using the merged label space, culminating in a resilient model proficient in dynamic traffic scenarios.
arXiv Detail & Related papers (2023-08-30T11:33:07Z)
Learning Explicit Object-Centric Representations with Vision Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z)
Vision Transformers for Action Recognition: A Survey [41.69370782177517]
Vision transformers are emerging as a powerful tool to solve computer vision problems. Recent techniques have proven the efficacy of transformers beyond the image domain to solve numerous video-related tasks. Human action recognition is receiving special attention from the research community due to its widespread applications.
arXiv Detail & Related papers (2022-09-13T02:57:05Z)
Towards Accurate Facial Landmark Detection via Cascaded Transformers [14.74021483826222]
We propose an accurate facial landmark detector based on cascaded transformers. With self-attention in transformers, our model can inherently exploit the structured relationships between landmarks. During cascaded refinement, our model is able to extract the most relevant image features around the target landmark for coordinate prediction.
arXiv Detail & Related papers (2022-08-23T08:42:13Z)
Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks. We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z)
Detect and Locate: A Face Anti-Manipulation Approach with Semantic and Noise-level Supervision [67.73180660609844]
We propose a conceptually simple but effective method to efficiently detect forged faces in an image. The proposed scheme relies on a segmentation map that delivers meaningful high-level semantic information clues about the image. The proposed model achieves state-of-the-art detection accuracy and remarkable localization performance.
arXiv Detail & Related papers (2021-07-13T02:59:31Z)
Pretrained equivariant features improve unsupervised landmark discovery [69.02115180674885]
We formulate a two-step unsupervised approach that overcomes this challenge by first learning powerful pixel-based features. Our method produces state-of-the-art results in several challenging landmark detection datasets.
arXiv Detail & Related papers (2021-04-07T05:42:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.