GIM: Learning Generalizable Image Matcher From Internet Videos
- URL: http://arxiv.org/abs/2402.11095v1
- Date: Fri, 16 Feb 2024 21:48:17 GMT
- Title: GIM: Learning Generalizable Image Matcher From Internet Videos
- Authors: Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias M\"uller, Zijun Li,
Kaixuan Wang, Xiaozhi Chen, Cheng Wang
- Abstract summary: We propose GIM, a self-training framework for learning a single generalizable model based on any image matching architecture.
We also propose ZEB, the first zero-shot evaluation benchmark for image matching.
- Score: 18.974842517202365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image matching is a fundamental computer vision problem. While learning-based
methods achieve state-of-the-art performance on existing benchmarks, they
generalize poorly to in-the-wild images. Such methods typically need to train
separate models for different scene types and are impractical when the scene
type is unknown in advance. One of the underlying problems is the limited
scalability of existing data construction pipelines, which limits the diversity
of standard image matching datasets. To address this problem, we propose GIM, a
self-training framework for learning a single generalizable model based on any
image matching architecture using internet videos, an abundant and diverse data
source. Given an architecture, GIM first trains it on standard domain-specific
datasets and then combines it with complementary matching methods to create
dense labels on nearby frames of novel videos. These labels are filtered by
robust fitting, and then enhanced by propagating them to distant frames. The
final model is trained on propagated data with strong augmentations. We also
propose ZEB, the first zero-shot evaluation benchmark for image matching. By
mixing data from diverse domains, ZEB can thoroughly assess the cross-domain
generalization performance of different methods. Applying GIM consistently
improves the zero-shot performance of 3 state-of-the-art image matching
architectures; with 50 hours of YouTube videos, the relative zero-shot
performance improves by 8.4%-18.1%. GIM also enables generalization to extreme
cross-domain data such as Bird Eye View (BEV) images of projected 3D point
clouds (Fig. 1(c)). More importantly, our single zero-shot model consistently
outperforms domain-specific baselines when evaluated on downstream tasks
inherent to their respective domains. The video presentation is available at
https://www.youtube.com/watch?v=FU_MJLD8LeY.
Related papers
- Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video.
We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z) - We're Not Using Videos Effectively: An Updated Domain Adaptive Video
Segmentation Baseline [19.098970392639476]
Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking.
We find that even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods outperform Video-DAS methods on established Video-DAS benchmarks.
arXiv Detail & Related papers (2024-02-01T18:59:56Z) - DG-TTA: Out-of-domain medical image segmentation through Domain Generalization and Test-Time Adaptation [43.842694540544194]
We propose to combine domain generalization and test-time adaptation to create a highly effective approach for reusing pre-trained models in unseen target domains.
We demonstrate that our method, combined with pre-trained whole-body CT models, can effectively segment MR images with high accuracy.
arXiv Detail & Related papers (2023-12-11T10:26:21Z) - Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.
We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z) - GeneCIS: A Benchmark for General Conditional Image Similarity [21.96493413291777]
We argue that there are many notions of'similarity' and that models, like humans, should be able to adapt to these dynamically.
We propose the GeneCIS benchmark, which measures models' ability to adapt to a range of similarity conditions.
arXiv Detail & Related papers (2023-06-13T17:59:58Z) - Learnable Graph Matching: A Practical Paradigm for Data Association [74.28753343714858]
We propose a general learnable graph matching method to address these issues.
Our method achieves state-of-the-art performance on several MOT datasets.
For image matching, our method outperforms state-of-the-art methods on a popular indoor dataset, ScanNet.
arXiv Detail & Related papers (2023-03-27T17:39:00Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - InvGAN: Invertible GANs [88.58338626299837]
InvGAN, short for Invertible GAN, successfully embeds real images to the latent space of a high quality generative model.
This allows us to perform image inpainting, merging, and online data augmentation.
arXiv Detail & Related papers (2021-12-08T21:39:00Z) - Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets.
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets.
In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z) - Reconciliation of Statistical and Spatial Sparsity For Robust Image and
Image-Set Classification [27.319334479994787]
We propose a novel Joint Statistical and Spatial Sparse representation, dubbed textitJ3S, to model the image or image-set data for classification.
We propose to solve the joint sparse coding problem based on the J3S model, by coupling the local and global image representations using joint sparsity.
Experiments show that the proposed J3S-based image classification scheme outperforms the popular or state-of-the-art competing methods over FMD, UIUC, ETH-80 and YTC databases.
arXiv Detail & Related papers (2021-06-01T06:33:24Z) - Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.