Related papers: Learning Rich Nearest Neighbor Representations from Self-supervised Ensembles

Related papers

Self-Evaluation Unlocks Any-Step Text-to-Image Generation [65.7088507945307]
We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation.<n>Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism.<n>Experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps.
arXiv Detail & Related papers (2025-12-26T20:42:11Z)
Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z)
GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation [51.95701097588426]
We introduce a Global Perspective Tokenizer (GloTok) to model a more uniform semantic distribution of tokenized features.<n>A residual learning module is proposed to recover the fine-grained details to minimize the reconstruction error caused by quantization.<n>Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.
arXiv Detail & Related papers (2025-11-18T06:40:26Z)
Self-Supervised Graph Learning via Spectral Bootstrapping and Laplacian-Based Augmentations [1.0377683220196872]
We present LaplaceGNN, a novel self-supervised graph learning framework.<n>Our method integrates Laplacian-based signals into the learning process.<n>LaplaceGNN achieves superior performance compared to state-of-the-art self-supervised graph methods.
arXiv Detail & Related papers (2025-06-25T12:23:23Z)
Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation [29.809079908218607]
This work introduces a fresh solution to reinforce base pseudo-labels and facilitate target-prompt learning.<n>We first propose to leverage the reference predictions based on the relationship between source and target visual embeddings.<n>We later show that there is a strong clustering behavior observed between visual and text embeddings in pre-trained multi-modal models.
arXiv Detail & Related papers (2025-06-13T06:33:27Z)
Self Distillation via Iterative Constructive Perturbations [0.2748831616311481]
We propose a novel framework that uses a cyclic optimization strategy to concurrently optimize the model and its input data for better training.<n>By alternately altering the model's parameters to the data and the data to the model, our method effectively addresses the gap between fitting and generalization.
arXiv Detail & Related papers (2025-05-20T13:15:27Z)
DDAE++: Enhancing Diffusion Models Towards Unified Generative and Discriminative Learning [53.27049077100897]
generative pre-training has been shown to yield discriminative representations, paving the way towards unified visual generation and understanding.<n>This work introduces self-conditioning, a mechanism that internally leverages the rich semantics inherent in denoising network to guide its own decoding layers.<n>Results are compelling: our method boosts both generation FID and recognition accuracy with 1% computational overhead and generalizes across diverse diffusion architectures.
arXiv Detail & Related papers (2025-05-16T08:47:16Z)
Learning Transformer-based World Models with Contrastive Predictive Coding [58.0159270859475]
We show that the next state prediction objective is insufficient to fully exploit the representation capabilities of Transformers. We propose to extend world model predictions to longer time horizons by introducing TWISTER, a world model using action-conditioned Contrastive Predictive Coding. TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
arXiv Detail & Related papers (2025-03-06T13:18:37Z)
EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling [8.250616459360684]
We introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models.<n>Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding benchmark, and 3D first-person ViZDoom environments.
arXiv Detail & Related papers (2025-02-01T15:49:59Z)
Boosting Alignment for Post-Unlearning Text-to-Image Generative Models [55.82190434534429]
Large-scale generative models have shown impressive image-generation capabilities, propelled by massive data. This often inadvertently leads to the generation of harmful or inappropriate content and raises copyright concerns. We propose a framework that seeks an optimal model update at each unlearning iteration, ensuring monotonic improvement on both objectives.
arXiv Detail & Related papers (2024-12-09T21:36:10Z)
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z)
ReCoRe: Regularized Contrastive Representation Learning of World Model [21.29132219042405]
We present a world model that learns invariant features using contrastive unsupervised learning and an intervention-invariant regularizer. Our method outperforms current state-of-the-art model-based and model-free RL methods and significantly improves on out-of-distribution point navigation tasks evaluated on the iGibson benchmark.
arXiv Detail & Related papers (2023-12-14T15:53:07Z)
On-the-Fly Guidance Training for Medical Image Registration [14.309599960641242]
This study introduces a novel On-the-Fly Guidance (OFG) training framework for enhancing existing learning-based image registration models. Our method proposes a supervised fashion for training registration models, without the need for any labeled data. Our method is tested across several benchmark datasets and leading models, it significantly enhanced performance.
arXiv Detail & Related papers (2023-08-29T11:12:53Z)
Learning Large-scale Neural Fields via Context Pruned Meta-Learning [60.93679437452872]
We introduce an efficient optimization-based meta-learning technique for large-scale neural field training. We show how gradient re-scaling at meta-test time allows the learning of extremely high-quality neural fields. Our framework is model-agnostic, intuitive, straightforward to implement, and shows significant reconstruction improvements for a wide range of signals.
arXiv Detail & Related papers (2023-02-01T17:32:16Z)
Iterative autoregression: a novel trick to improve your low-latency speech enhancement model [2.2999148299770047]
Streaming models are an essential component of real-time speech enhancement tools. We propose a straightforward yet effective alternative technique for training autoregressive low-latency speech enhancement models.
arXiv Detail & Related papers (2022-11-03T12:32:33Z)
Mean Embeddings with Test-Time Data Augmentation for Ensembling of Representations [8.336315962271396]
We look at the ensembling of representations and propose mean embeddings with test-time augmentation (MeTTA) MeTTA significantly boosts the quality of linear evaluation on ImageNet for both supervised and self-supervised models. We believe that spreading the success of ensembles to inference higher-quality representations is the important step that will open many new applications of ensembling.
arXiv Detail & Related papers (2021-06-15T10:49:46Z)
Learning by Distillation: A Self-Supervised Learning Framework for Optical Flow Estimation [71.76008290101214]
DistillFlow is a knowledge distillation approach to learning optical flow. It achieves state-of-the-art unsupervised learning performance on both KITTI and Sintel datasets. Our models ranked 1st among all monocular methods on the KITTI 2015 benchmark, and outperform all published methods on the Sintel Final benchmark.
arXiv Detail & Related papers (2021-06-08T09:13:34Z)
Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training. We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area. Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos. This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.