Related papers: Affine steerers for structured keypoint description

Affine steerers for structured keypoint description

URL: http://arxiv.org/abs/2408.14186v1
Date: Mon, 26 Aug 2024 11:22:52 GMT
Title: Affine steerers for structured keypoint description
Authors: Georg Bökman, Johan Edstedt, Michael Felsberg, Fredrik Kahl,
Abstract summary: We propose a way to train deep learning based keypoint descriptors that makes them approximately equivariant for locally affine transformations of the image plane. We demonstrate the potential of using this control for image matching.
Score: 26.31402935889126
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a way to train deep learning based keypoint descriptors that makes them approximately equivariant for locally affine transformations of the image plane. The main idea is to use the representation theory of GL(2) to generalize the recently introduced concept of steerers from rotations to affine transformations. Affine steerers give high control over how keypoint descriptions transform under image transformations. We demonstrate the potential of using this control for image matching. Finally, we propose a way to finetune keypoint descriptors with a set of steerers on upright images and obtain state-of-the-art results on several standard benchmarks. Code will be published at github.com/georg-bn/affine-steerers.

Related papers

GViT: Representing Images as Gaussians for Visual Recognition [54.46109876668194]
We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians.<n>We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT.
arXiv Detail & Related papers (2025-06-30T05:44:14Z)
Rethinking Decoders for Transformer-based Semantic Segmentation: Compression is All You Need [3.218600495900291]
We argue that there are fundamental connections between semantic segmentation and compression. We derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT) Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter.
arXiv Detail & Related papers (2024-11-05T12:10:02Z)
Continuous Piecewise-Affine Based Motion Model for Image Animation [45.55812811136834]
Image animation aims to bring static images to life according to driving videos. Recent unsupervised methods utilize affine and thin-plate spline transformations based on keypoints to transfer the motion in driving frames to the source image. We propose to model motion from the source image to the driving frame in highly-expressive diffeo spaces.
arXiv Detail & Related papers (2024-01-17T11:40:05Z)
Steerers: A framework for rotation equivariant keypoint descriptors [26.31402935889126]
Keypoint descriptions that are discriminative and matchable over large changes in viewpoint are vital for 3D reconstruction. We learn a linear transform in description space that encodes rotations of the input image. We obtain state-of-the-art results on the rotation invariant image matching benchmarks AIMS and Roto-360.
arXiv Detail & Related papers (2023-12-04T18:59:44Z)
Self-supervised Cross-view Representation Reconstruction for Change Captioning [113.08380679787247]
Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. We propose a self-supervised cross-view representation reconstruction network.
arXiv Detail & Related papers (2023-09-28T09:28:50Z)
Generalizable Person Re-Identification via Viewpoint Alignment and Fusion [74.30861504619851]
This work proposes to use a 3D dense pose estimation model and a texture mapping module to map pedestrian images to canonical view images. Due to the imperfection of the texture mapping module, the canonical view images may lose the discriminative detail clues from the original images. We show that our method can lead to superior performance over the existing approaches in various evaluation settings.
arXiv Detail & Related papers (2022-12-05T16:24:09Z)
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z)
Self-Supervised Equivariant Learning for Oriented Keypoint Detection [35.94215211409985]
We introduce a self-supervised learning framework using rotation-equivariant CNNs to learn to detect robust oriented keypoints. We propose a dense orientation alignment loss by an image pair generated by synthetic transformations for training a histogram-based orientation map. Our method outperforms the previous methods on an image matching benchmark and a camera pose estimation benchmark.
arXiv Detail & Related papers (2022-04-19T02:26:07Z)
Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning. We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z)
Grounded Situation Recognition with Transformers [11.202435939275675]
Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture.
arXiv Detail & Related papers (2021-11-19T10:10:03Z)
Transformer-Based Deep Image Matching for Generalizable Person Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images. We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention. We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
Coarse-to-Fine Gaze Redirection with Numerical and Pictorial Guidance [74.27389895574422]
We propose a novel gaze redirection framework which exploits both a numerical and a pictorial direction guidance. The proposed method outperforms the state-of-the-art approaches in terms of both image quality and redirection precision.
arXiv Detail & Related papers (2020-04-07T01:17:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.