Related papers: ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

URL: http://arxiv.org/abs/2511.20274v1
Date: Tue, 25 Nov 2025 12:59:31 GMT
Title: ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
Authors: Advik Sinha, Saurabh Atreya, Aashutosh A, Sk Aziz Ali, Abhijit Das,
Abstract summary: PyramidCLIP aims to align global and local visual features, yet it still lacks explicit modeling of inter-object relations.<n>We generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets.<n>We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods.
Score: 4.611741386167832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at https://github.com/scenario-clip/ScenarioCLIP

Related papers

Referring Expression Instance Retrieval and A Strong End-to-End Baseline [37.47466772169063]
Text-Image Retrieval retrieves a target image from a gallery based on an image-level description.<n>Referring Expression localizes a target object within a given image using an instance-level description.<n>We introduce a new task called textbfReferring Expression Instance Retrieval (REIR), which supports both instance-level retrieval and localization.
arXiv Detail & Related papers (2025-06-23T02:28:44Z)
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation [3.1667055223489786]
Contrastive Language-Image Pre-training models excel in zero-shot classification, yet face challenges in complex multi-object scenarios.<n>This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO.<n>Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects.
arXiv Detail & Related papers (2025-02-27T07:34:42Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Scene Graph Generation with Role-Playing Large Language Models [50.252588437973245]
Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP. We propose SDSGG, a scene-specific description based OVSGG framework. To capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter.
arXiv Detail & Related papers (2024-10-20T11:40:31Z)
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.<n>We have established a new REC dataset characterized by two key features.<n>It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z)
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling [81.69474860607542]
We present Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. We also present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided.
arXiv Detail & Related papers (2024-08-07T11:20:37Z)
Test-time Contrastive Concepts for Open-world Semantic Segmentation with Vision-Language Models [14.899741072838994]
Recent CLIP-like Vision-Language Models (VLMs), pre-trained on large amounts of image-text pairs, have paved the way to open-vocabulary semantic segmentation.<n>We propose two different approaches to automatically generate, at test time, query-specific textual contrastive concepts.
arXiv Detail & Related papers (2024-07-06T12:18:43Z)
Placing Objects in Context via Inpainting for Out-of-distribution Segmentation [59.00092709848619]
Placing Objects in Context (POC) is a pipeline to realistically add objects to an image. POC can be used to extend any dataset with an arbitrary number of objects. We present different anomaly segmentation datasets based on POC-generated data and show that POC can improve the performance of recent state-of-the-art anomaly fine-tuning methods.
arXiv Detail & Related papers (2024-02-26T08:32:41Z)
Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z)
Diverse Image Captioning with Context-Object Split Latent Spaces [22.95979735707003]
We introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts. Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data.
arXiv Detail & Related papers (2020-11-02T13:33:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.