Related papers: How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining

Related papers

Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs [44.71703930770065]
We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like face matching and text-in-vision comprehension capabilities.<n>The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations.
arXiv Detail & Related papers (2025-12-17T20:22:23Z)
Skill Learning via Policy Diversity Yields Identifiable Representations for Reinforcement Learning [67.90033766878962]
Self-supervised feature learning (RL) often rely on information-theoretic principles, termed mutual information skill learning (MISL)<n>Our work investigates MISL through the lens of identifiable representation learning.<n>We prove that Contrastive Successor Features (CSF) can provably recover the environment's ground-truth features up to a linear transformation.
arXiv Detail & Related papers (2025-07-19T20:48:46Z)
Provable Low-Frequency Bias of In-Context Learning of Representations [19.066378730056275]
In-context learning (ICL) enables large language models (LLMs) to acquire new behaviors from the input sequence alone without any parameter updates.<n>Recent studies have shown that ICL can surpass the original meaning learned in pretraining stage through internalizing the structure the data-generating process (DGP) of the prompt into the hidden representations.<n>We present the first rigorous explanation of such phenomena by introducing a unified framework of double convergence.<n>This double convergence process leads to an implicit bias towards smooth (low-frequency) representations, which we prove analytically and verify empirically.
arXiv Detail & Related papers (2025-07-17T21:19:32Z)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework [4.935224714809964]
We present Multi-Scale Manifold Alignment (MSMA), an information-geometric framework that decomposes LLM representations into local, intermediate, and global manifold.<n>We observe consistent hierarchical patterns and find that MSMA improves alignment metrics under multiple estimators.<n> Controlled interventions at different scales yield distinct and architecture-dependent effects on lexical diversity, sentence structure, and discourse coherence.
arXiv Detail & Related papers (2025-05-24T10:25:58Z)
StructFormer: Document Structure-based Masked Attention and its Impact on Language Model Pre-Training [20.79815837785261]
This study focuses on empirically assessing the influence of global attention on BERT pre-training. We create an extensive corpus of structure-aware text through arXiv data, alongside a text-only counterpart. Our analysis underscores the significance of incorporating document structure into LM models, demonstrating their capacity to excel in more abstract tasks.
arXiv Detail & Related papers (2024-11-25T17:57:52Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Non-asymptotic Convergence of Training Transformers for Next-token Prediction [48.9399496805422]
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer. We show that the trained transformer presents non-token prediction ability with dataset shift.
arXiv Detail & Related papers (2024-09-25T20:22:06Z)
Theoretical Insights into Overparameterized Models in Multi-Task and Replay-Based Continual Learning [36.92660589442233]
Multi-task learning (MTL) aims to improve the generalization performance of a model on multiple related tasks by training it simultaneously on those tasks. Continual learning (CL) involves adapting to new sequentially arriving tasks over time without forgetting the previously acquired knowledge. We develop theoretical results describing the effect of various system parameters on the model's performance in an MTL setup. Our results reveal the impact of buffer size and model capacity on the forgetting rate in a CL setup and help shed light on some of the state-of-the-art CL methods.
arXiv Detail & Related papers (2024-08-29T23:22:40Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
Balanced Multi-Relational Graph Clustering [5.531383184058319]
Multi-relational graph clustering has demonstrated remarkable success in uncovering underlying patterns in complex networks. Our empirical study finds the pervasive presence of imbalance in real-world graphs, which is in principle contradictory to the motivation of alignment. We propose Balanced Multi-Relational Graph Clustering (BMGC), comprising unsupervised dominant view mining and dual signals guided representation learning.
arXiv Detail & Related papers (2024-07-23T22:11:13Z)
On the Universal Truthfulness Hyperplane Inside LLMs [27.007142483859162]
We investigate whether a universal truthfulness hyperplane that distinguishes the model's factually correct and incorrect outputs exists within the model. Our results indicate that increasing the diversity of the training datasets significantly enhances the performance in all scenarios.
arXiv Detail & Related papers (2024-07-11T15:07:26Z)
CorrMAE: Pre-training Correspondence Transformers with Masked Autoencoder [44.94921073819524]
We propose a pre-training method to acquire a generic inliers-consistent representation by reconstructing masked correspondences. In practice, we introduce CorrMAE, an extension of the mask autoencoder framework tailored for the pre-training of correspondence pruning.
arXiv Detail & Related papers (2024-06-09T13:14:00Z)
What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights [67.72413262980272]
Severe data imbalance naturally exists among web-scale vision-language datasets. We find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning. The robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts.
arXiv Detail & Related papers (2024-05-31T17:57:24Z)
On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase. Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z)
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift [14.641747166801133]
multimodal contrastive learning approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift. We identify two mechanisms behind MMCL's robustness: emphintra-class contrasting and emphinter-class feature sharing. We theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions.
arXiv Detail & Related papers (2023-10-08T02:25:52Z)
Unsupervised discovery of Interpretable Visual Concepts [0.0]
We propose two methods to explain a model's decision, enhancing global interpretability. One method is inspired by Occlusion and Sensitivity analysis (incorporating causality) The other method uses a novel metric, called Class-aware Order Correlation (CaOC), to globally evaluate the most important image regions.
arXiv Detail & Related papers (2023-08-31T07:53:02Z)
Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders [7.133110402648305]
This study explores the application of self-supervised learning to the task of motion forecasting. Forecast-MAE is an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task.
arXiv Detail & Related papers (2023-08-19T02:27:51Z)
ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers [7.725095281624494]
We evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative. We observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry.
arXiv Detail & Related papers (2023-06-19T09:38:21Z)
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT) MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies. Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z)
Style-Hallucinated Dual Consistency Learning: A Unified Framework for Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks. Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z)
Spatial Entropy Regularization for Vision Transformers [71.44392961125807]
Vision Transformers (VTs) can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. We propose a VT regularization method based on a spatial formulation of the information entropy. We show that the proposed regularization approach is beneficial with different training scenarios, datasets, downstream tasks and VT architectures.
arXiv Detail & Related papers (2022-06-09T17:34:39Z)
Self-Supervised Models are Continual Learners [79.70541692930108]
We show that self-supervised loss functions can be seamlessly converted into distillation mechanisms for Continual Learning. We devise a framework for Continual self-supervised visual representation Learning that significantly improves the quality of the learned representations.
arXiv Detail & Related papers (2021-12-08T10:39:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.