Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
- URL: http://arxiv.org/abs/2510.25070v1
- Date: Wed, 29 Oct 2025 01:16:21 GMT
- Title: Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
- Authors: Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi,
- Abstract summary: This work proposes a vision-language integration framework that unifies pre-trained visual encoders and large language models.<n>The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.
Related papers
- Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios [0.0]
This work introduces a Dynamic Context-Aware Scene Reasoning framework to address zero-shot real-world scenarios.<n>The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions.<n>Experiments demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments.
arXiv Detail & Related papers (2025-10-30T15:07:55Z) - Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z) - Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications [0.0]
We propose a novel Context-Aware Semantic framework that integrates Large Language Models (LLMs) with state-of-the-art vision backbones.<n>A Cross-Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively.<n>This work bridges the gap between vision and language, paving the path for more intelligent and context-aware vision systems in applications including autonomous driving, medical imaging, and robotics.
arXiv Detail & Related papers (2025-03-25T02:12:35Z) - Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition [50.86415025650168]
Masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge.<n>We propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch.
arXiv Detail & Related papers (2025-03-24T14:53:35Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - LanGWM: Language Grounded World Model [24.86620763902546]
We focus on learning language-grounded visual features to enhance the world model learning.
Our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction.
arXiv Detail & Related papers (2023-11-29T12:41:55Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity
Resolution [0.0]
We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description.
Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains.
We introduce a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations.
arXiv Detail & Related papers (2022-05-24T14:12:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.