Related papers: PixelWorld: How Far Are We from Perceiving Everything as Pixels?

PixelWorld: How Far Are We from Perceiving Everything as Pixels?

URL: http://arxiv.org/abs/2501.19339v3
Date: Tue, 21 Oct 2025 19:23:59 GMT
Title: PixelWorld: How Far Are We from Perceiving Everything as Pixels?
Authors: Zhiheng Lyu, Xueguang Ma, Wenhu Chen,
Abstract summary: Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information.<n>We introduce Perceive Everything as Pixels (PEAP), a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space.<n>Experiments show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks.
Score: 62.068243387551085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized text. This shift highlights the need for a unified perception paradigm. To investigate this idea, we explore Perceive Everything as Pixels (PEAP) and introduce PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space. Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In contrast, reasoning-intensive tasks such as mathematics and code show notable performance degradation, although Chain-of-Thought prompting helps mitigate this gap by compensating for missing symbolic structure. We further find that when visual and textual information are closely integrated, representing everything as pixels simplifies preprocessing and avoids cross-modal misalignment. PixelWorld thus provides a systematic and practical framework for evaluating unified vision--language models and facilitates further exploration of pixel-based multimodal learning.

Related papers

UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation [51.31795451147935]
We present a unified generative model that supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework.<n>Our goal is to achieve unification along three axes: the model, the tasks, and the representations.<n> Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment.
arXiv Detail & Related papers (2025-11-21T03:02:10Z)
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning [83.68366772745689]
We propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses.<n>Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference.<n>The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos.
arXiv Detail & Related papers (2025-09-22T17:59:40Z)
Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications [0.0]
We propose a novel Context-Aware Semantic framework that integrates Large Language Models (LLMs) with state-of-the-art vision backbones.<n>A Cross-Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively.<n>This work bridges the gap between vision and language, paving the path for more intelligent and context-aware vision systems in applications including autonomous driving, medical imaging, and robotics.
arXiv Detail & Related papers (2025-03-25T02:12:35Z)
Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On [0.0]
PixelSHAP is a framework extending Shapley-based analysis to structured visual entities. It applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response. It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods.
arXiv Detail & Related papers (2025-03-09T15:43:55Z)
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
KeyPoint Relative Position Encoding for Face Recognition [15.65725865703615]
Keypoint RPE (KP-RPE) is an extension of the principle where significance of pixels is not solely dictated by their proximity. Code and pre-trained models are available.
arXiv Detail & Related papers (2024-03-21T21:56:09Z)
Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching [58.10418136917358]
Cross-modality registration between 2D images from cameras and 3D point clouds from LiDARs is a crucial task in computer vision and robotic training. Previous methods estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks. We learn a structured cross-modality matching solver to represent 3D features via a different latent pixel space.
arXiv Detail & Related papers (2023-12-07T05:46:10Z)
Aligning and Prompting Everything All at Once for Universal Visual Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks. APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection. Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z)
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching Network [25.511804582983977]
Panoramic Narrative Grounding (PNG) aims to segment visual objects in images based on dense narrative captions. We propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN) DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels.
arXiv Detail & Related papers (2023-10-25T13:12:39Z)
Superpixel Semantics Representation and Pre-training for Vision-Language Task [11.029236633301222]
coarse-grained semantic interactions in image space should not be ignored. This paper proposes superpixels as comprehensive and robust visual primitives. It allows parsing the entire image as a fine-to-coarse visual hierarchy.
arXiv Detail & Related papers (2023-10-20T12:26:04Z)
Pixel-Inconsistency Modeling for Image Manipulation Localization [59.968362815126326]
Digital image forensics plays a crucial role in image authentication and manipulation localization. This paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts. Experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints.
arXiv Detail & Related papers (2023-09-30T02:54:51Z)
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process [94.41510903676837]
We propose an Alternating Denoising Diffusion Process (ADDP) that integrates two spaces within a single representation learning framework. In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks.
arXiv Detail & Related papers (2023-06-08T17:59:32Z)
Learn how to Prune Pixels for Multi-view Neural Image-based Synthesis [10.571582038258443]
We present LeHoPP, a method for input pixel pruning. We examine the importance of each input pixel concerning the rendered view, and we avoid the use of irrelevant pixels. Even without retraining the image-based rendering network, our approach shows a good trade-off between synthesis quality and pixel rate.
arXiv Detail & Related papers (2023-05-05T14:29:24Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Adaptive Single Image Deblurring [43.02281823557039]
We propose an efficient pixel adaptive and feature attentive design for handling large blur variations within and across different images. We also propose an effective content-aware global-local filtering module that significantly improves the performance.
arXiv Detail & Related papers (2022-01-01T10:10:19Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Exploring Cross-Image Pixel Contrast for Semantic Segmentation [130.22216825377618]
We propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing.
arXiv Detail & Related papers (2021-01-28T11:35:32Z)
ITSELF: Iterative Saliency Estimation fLexible Framework [68.8204255655161]
Saliency object detection estimates the objects that most stand out in an image. We propose a superpixel-based ITerative Saliency Estimation fLexible Framework (ITSELF) that allows any user-defined assumptions to be added to the model. We compare ITSELF to two state-of-the-art saliency estimators on five metrics and six datasets.
arXiv Detail & Related papers (2020-06-30T16:51:31Z)
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers [46.275416873403614]
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding. Our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR)
arXiv Detail & Related papers (2020-04-02T07:39:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.