Context Matters: Learning Global Semantics via Object-Centric Representation
- URL: http://arxiv.org/abs/2510.05674v2
- Date: Wed, 08 Oct 2025 18:28:20 GMT
- Title: Context Matters: Learning Global Semantics via Object-Centric Representation
- Authors: Jike Zhong, Yuxiang Lai, Xiaofeng Yang, Konstantinos Psounis,
- Abstract summary: Vision models have yet to exhibit comparable progress in in-context learning.<n>We argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes.<n>We propose to directly model "object" as the visual equivalence of "word," pushing the model to learn the global context and semantics among visual elements.
- Score: 8.195437248815802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper, we argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes, and such a gap can be narrowed through the design of a semantic-grounded objective. Specifically, we notice that individual words in natural language are inherently semantic, and modeling directly on word tokens naturally learns a realistic distribution. In contrast, ViTs rely on spatial patchification, which inevitably lacks semantic information. To bridge this gap, we propose to directly model "object" as the visual equivalence of "word," pushing the model to learn the global context and semantics among visual elements. We investigate our hypotheses via masked image modeling (MIM), a framework where our approach can be readily tested by applying masks to visual objects rather than random patches. Considerable evidence from qualitative and quantitative evaluations reveals a key finding: object-level representation alone helps to learn a real-world distribution, whereas pixel-averaging shortcuts are often learned without it. Moreover, further evaluations with multimodal LLMs (MLLM) on visual question answering (VQA, GQA, ScienceQA) tasks demonstrate the strong reasoning and contextual understanding gained with this simple objective. We hope our study highlights the effectiveness of object-level encoding and provides a plausible direction for developing stronger vision encoders and tokenizers. Code and model will be publicly released. Keywords: Semantic Visual Tokenizer, Vision Reasoning, In-context Learning, Multimodal Reasoning
Related papers
- Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z) - COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.<n>It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.<n>It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - LanGWM: Language Grounded World Model [24.86620763902546]
We focus on learning language-grounded visual features to enhance the world model learning.
Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
arXiv Detail & Related papers (2023-11-29T12:41:55Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - Language Models as Zero-shot Visual Semantic Learners [0.618778092044887]
We propose a Visual Se-mantic Embedding Probe (VSEP) to probe the semantic information of contextualized word embeddings.
The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner.
We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short.
arXiv Detail & Related papers (2021-07-26T08:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.