PixelWorld: Towards Perceiving Everything as Pixels
- URL: http://arxiv.org/abs/2501.19339v2
- Date: Wed, 21 May 2025 02:35:00 GMT
- Title: PixelWorld: Towards Perceiving Everything as Pixels
- Authors: Zhiheng Lyu, Xueguang Ma, Wenhu Chen,
- Abstract summary: Perceive Everything as Pixels (PEAP) is a benchmark for rendering natural-language inputs into a single pixel space.<n>Experiments show that PEAP attains competitive accuracy on semantic-understanding tasks.<n>We also find that when visual and textual information are closely integrated, representing everything as pixels reduces preprocessing complexity.
- Score: 50.13953243722129
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent agentic language models increasingly need to interact directly with real-world environments containing intertwined visual and textual information through raw camera pixels, rather than relying on separate image and tokenized text processing, underscoring the necessity of a unified perception paradigm. To close this gap, we explore this idea through Perceive Everything as Pixels (PEAP) and release PixelWorld, a benchmark that renders natural-language, tabular, mathematical and diagrammatic inputs into a single pixel space. Experiments show that PEAP attains competitive accuracy on semantic-understanding tasks, indicating that a vision transformer can capture global textual semantics without explicit tokens. In contrast, reasoning-intensive benchmarks (math and code) exhibit sharp performance drops; however, Chain-of-Thought prompting partially mitigates this gap, hinting that explicit reasoning traces compensate for the missing token structure. We also find that when visual and textual information are closely integrated, representing everything as pixels reduces preprocessing complexity and avoids misalignment issues that often arise in separate pipelines. PixelWorld therefore serves as a practical benchmark for evaluating unified vision-language models and supports broader exploration of PEAP across diverse tasks.
Related papers
- Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications [0.0]
We propose a novel Context-Aware Semantic framework that integrates Large Language Models (LLMs) with state-of-the-art vision backbones.<n>A Cross-Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively.<n>This work bridges the gap between vision and language, paving the path for more intelligent and context-aware vision systems in applications including autonomous driving, medical imaging, and robotics.
arXiv Detail & Related papers (2025-03-25T02:12:35Z) - Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On [0.0]
PixelSHAP is a framework extending Shapley-based analysis to structured visual entities.
It applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response.
It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods.
arXiv Detail & Related papers (2025-03-09T15:43:55Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - KeyPoint Relative Position Encoding for Face Recognition [15.65725865703615]
Keypoint RPE (KP-RPE) is an extension of the principle where significance of pixels is not solely dictated by their proximity.
Code and pre-trained models are available.
arXiv Detail & Related papers (2024-03-21T21:56:09Z) - Differentiable Registration of Images and LiDAR Point Clouds with
VoxelPoint-to-Pixel Matching [58.10418136917358]
Cross-modality registration between 2D images from cameras and 3D point clouds from LiDARs is a crucial task in computer vision and robotic training.
Previous methods estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks.
We learn a structured cross-modality matching solver to represent 3D features via a different latent pixel space.
arXiv Detail & Related papers (2023-12-07T05:46:10Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - Context Does Matter: End-to-end Panoptic Narrative Grounding with
Deformable Attention Refined Matching Network [25.511804582983977]
Panoramic Narrative Grounding (PNG) aims to segment visual objects in images based on dense narrative captions.
We propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN)
DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels.
arXiv Detail & Related papers (2023-10-25T13:12:39Z) - Superpixel Semantics Representation and Pre-training for Vision-Language Task [11.029236633301222]
coarse-grained semantic interactions in image space should not be ignored.
This paper proposes superpixels as comprehensive and robust visual primitives.
It allows parsing the entire image as a fine-to-coarse visual hierarchy.
arXiv Detail & Related papers (2023-10-20T12:26:04Z) - Pixel-Inconsistency Modeling for Image Manipulation Localization [59.968362815126326]
Digital image forensics plays a crucial role in image authentication and manipulation localization.
This paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts.
Experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints.
arXiv Detail & Related papers (2023-09-30T02:54:51Z) - ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process [94.41510903676837]
We propose an Alternating Denoising Diffusion Process (ADDP) that integrates two spaces within a single representation learning framework.
In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels.
The learned representations can be used to generate diverse high-fidelity images and also demonstrate excellent transfer performance on recognition tasks.
arXiv Detail & Related papers (2023-06-08T17:59:32Z) - Learn how to Prune Pixels for Multi-view Neural Image-based Synthesis [10.571582038258443]
We present LeHoPP, a method for input pixel pruning.
We examine the importance of each input pixel concerning the rendered view, and we avoid the use of irrelevant pixels.
Even without retraining the image-based rendering network, our approach shows a good trade-off between synthesis quality and pixel rate.
arXiv Detail & Related papers (2023-05-05T14:29:24Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Adaptive Single Image Deblurring [43.02281823557039]
We propose an efficient pixel adaptive and feature attentive design for handling large blur variations within and across different images.
We also propose an effective content-aware global-local filtering module that significantly improves the performance.
arXiv Detail & Related papers (2022-01-01T10:10:19Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Exploring Cross-Image Pixel Contrast for Semantic Segmentation [130.22216825377618]
We propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.
The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes.
Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing.
arXiv Detail & Related papers (2021-01-28T11:35:32Z) - ITSELF: Iterative Saliency Estimation fLexible Framework [68.8204255655161]
Saliency object detection estimates the objects that most stand out in an image.
We propose a superpixel-based ITerative Saliency Estimation fLexible Framework (ITSELF) that allows any user-defined assumptions to be added to the model.
We compare ITSELF to two state-of-the-art saliency estimators on five metrics and six datasets.
arXiv Detail & Related papers (2020-06-30T16:51:31Z) - Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
Transformers [46.275416873403614]
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding.
Our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR)
arXiv Detail & Related papers (2020-04-02T07:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.