Related papers: Learning 1D Causal Visual Representation with De-focus Attention Networks

Learning 1D Causal Visual Representation with De-focus Attention Networks

URL: http://arxiv.org/abs/2406.04342v1
Date: Thu, 6 Jun 2024 17:59:56 GMT
Title: Learning 1D Causal Visual Representation with De-focus Attention Networks
Authors: Chenxin Tao, Xizhou Zhu, Shiqian Su, Lewei Lu, Changyao Tian, Xuan Luo, Gao Huang, Hongsheng Li, Yu Qiao, Jie Zhou, Jifeng Dai,
Abstract summary: This paper explores the feasibility of representing images using 1D causal modeling. We propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns.
Score: 108.72931590504406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in constructing unified multi-modal models. This paper explores the feasibility of representing images using 1D causal modeling. We identify an "over-focus" issue in existing 1D causal vision models, where attention overly concentrates on a small proportion of visual tokens. The issue of "over-focus" hinders the model's ability to extract diverse visual features and to receive effective gradients for optimization. To address this, we propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns. During training, large and scheduled drop path rates, and an auxiliary loss on globally pooled features for global understanding tasks are introduced. These two strategies encourage the model to attend to a broader range of tokens and enhance network optimization. Extensive experiments validate the efficacy of our approach, demonstrating that 1D causal visual representation can perform comparably to 2D non-causal representation in tasks such as global perception, dense prediction, and multi-modal understanding. Code is released at https://github.com/OpenGVLab/De-focus-Attention-Networks.

Related papers

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains [31.828341309787042]
Vision-language models (VLMs) achieve remarkable success in single-image tasks. Real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline. We propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios.
arXiv Detail & Related papers (2025-04-28T19:02:18Z)
Interpreting Low-level Vision Models with Causal Effect Maps [25.07089157448049]
We introduce causality theory to interpret low-level vision models. We propose a model-/task-agnostic method called Causal Effect Map (CEM) CEM visualizes and quantify the input-output relationships on either positive or negative effects.
arXiv Detail & Related papers (2024-07-29T08:33:32Z)
Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise. MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains. Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR) We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Heuristic Vision Pre-Training with Self-Supervised and Supervised Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner. Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z)
Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment [47.53887941065894]
Pre-trained vision-language models have inspired much research on few-shot learning. With only a few training images, the visual feature distributions are easily distracted by class-irrelevant information in images. We propose a Selective Attack module that generates spatial attention maps of images to guide the attacks on class-irrelevant image areas.
arXiv Detail & Related papers (2023-05-19T05:45:17Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
Visual Attention Network [90.0753726786985]
We propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention. We also introduce a novel neural network based on LKA, namely Visual Attention Network (VAN) VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments.
arXiv Detail & Related papers (2022-02-20T06:35:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.