Related papers: Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning

Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning

URL: http://arxiv.org/abs/2506.03110v1
Date: Tue, 03 Jun 2025 17:40:36 GMT
Title: Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning
Authors: Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li,
Abstract summary: Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains.<n>But it still faces challenges when applying it to downstream distant domains that have only scarce training data.<n>Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens.
Score: 19.199947811410123
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available at https://github.com/shuaiyi308/ReCIT.

Related papers

Random Registers for Cross-Domain Few-Shot Learning [19.199947811410123]
Cross-domain few-shot learning aims to transfer knowledge from a data-sufficient source domain to data-scarce target domains.<n>We find that during the source-domain training, prompt tuning, as a common way to train ViT, could be harmful for the generalization of ViT in target domains.<n>We propose a simple but effective approach for CDFSL by adding random registers on the semantic regions of image tokens.
arXiv Detail & Related papers (2025-06-03T13:13:58Z)
Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation [0.0]
Cross-domain few-shot segmentation (CD-FSS) has emerged. We show test-time task-adaption is the key for successful CD-FSS. Despite our self-restriction not to use any images other than the few labeled samples at test time, we achieve new state-of-the-art performance in CD-FSS.
arXiv Detail & Related papers (2024-02-27T15:43:53Z)
Multi-cropping Contrastive Learning and Domain Consistency for Unsupervised Image-to-Image Translation [5.562419999563734]
We propose a novel unsupervised image-to-image translation framework based on multi-cropping contrastive learning and domain consistency, called MCDUT. In many image-to-image translation tasks, our method achieves state-of-the-art results, and the advantages of our method have been proven through comparison experiments and ablation research.
arXiv Detail & Related papers (2023-04-24T16:20:28Z)
Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration. ART presents both dense and sparse attention modules in the network. We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z)
Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains [80.11169390071869]
Adversarial examples have posed a severe threat to deep neural networks due to their transferable nature. We propose a Beyond ImageNet Attack (BIA) to investigate the transferability towards black-box domains. Our methods outperform state-of-the-art approaches by up to 7.71% (towards coarse-grained domains) and 25.91% (towards fine-grained domains) on average.
arXiv Detail & Related papers (2022-01-27T14:04:27Z)
Leveraging Local Domains for Image-to-Image Translation [11.03611991082568]
Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. We leverage human knowledge about spatial domain characteristics which we refer to as 'local domains' We train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target.
arXiv Detail & Related papers (2021-09-09T17:59:52Z)
TFill: Image Completion via a Transformer-Based Architecture [69.62228639870114]
We propose treating image completion as a directionless sequence-to-sequence prediction task. We employ a restrictive CNN with small and non-overlapping RF for token representation. In a second phase, to improve appearance consistency between visible and generated regions, a novel attention-aware layer (AAL) is introduced.
arXiv Detail & Related papers (2021-04-02T01:42:01Z)
Image-to-image Mapping with Many Domains by Sparse Attribute Transfer [71.28847881318013]
Unsupervised image-to-image translation consists of learning a pair of mappings between two domains without known pairwise correspondences between points. Current convention is to approach this task with cycle-consistent GANs. We propose an alternate approach that directly restricts the generator to performing a simple sparse transformation in a latent layer.
arXiv Detail & Related papers (2020-06-23T19:52:23Z)
Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields. To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss. We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
CrDoCo: Pixel-level Domain Transfer with Cross-Domain Consistency [119.45667331836583]
Unsupervised domain adaptation algorithms aim to transfer the knowledge learned from one domain to another. We present a novel pixel-wise adversarial domain adaptation algorithm.
arXiv Detail & Related papers (2020-01-09T19:00:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.