Resilience through Scene Context in Visual Referring Expression Generation
- URL: http://arxiv.org/abs/2404.12289v2
- Date: Fri, 23 Aug 2024 13:58:12 GMT
- Title: Resilience through Scene Context in Visual Referring Expression Generation
- Authors: Simeon Junker, Sina Zarrieß,
- Abstract summary: We investigate the role of context in Referring Expression Generation (REG) for objects in images.
We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient.
- Score: 8.883534683127415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.
Related papers
- COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication [3.829419993067886]
We show that Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects.<n>Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise.
arXiv Detail & Related papers (2025-06-27T14:44:45Z) - v-CLR: View-Consistent Learning for Open-World Instance Segmentation [24.32192108470939]
Vanilla visual networks are biased toward learning appearance information, eg texture, to recognize objects.
This implicit bias causes the model to fail in detecting novel objects with textures unseen in the open-world setting.
We propose view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation.
arXiv Detail & Related papers (2025-04-02T05:52:30Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - Visual Context-Aware Person Fall Detection [52.49277799455569]
We present a segmentation pipeline to semi-automatically separate individuals and objects in images.
Background objects such as beds, chairs, or wheelchairs can challenge fall detection systems, leading to false positive alarms.
We demonstrate that object-specific contextual transformations during training effectively mitigate this challenge.
arXiv Detail & Related papers (2024-04-11T19:06:36Z) - Context-Guided Spatio-Temporal Video Grounding [22.839160907707885]
We propose a framework, context-guided STVG (CG-STVG), which mines discriminative instance context for object in videos.
CG-STVG enjoys object information in text query and guidance from mined instance visual context for more accurate target localization.
In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in m_tIoU and m_vIoU.
arXiv Detail & Related papers (2024-01-03T07:05:49Z) - DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z) - Context-LGM: Leveraging Object-Context Relation for Context-Aware Object
Recognition [48.5398871460388]
We propose a novel Contextual Latent Generative Model (Context-LGM), which considers the object-context relation and models it in a hierarchical manner.
To infer contextual features, we reformulate the objective function of Variational Auto-Encoder (VAE), where contextual features are learned as a posterior conditioned distribution on the object.
The effectiveness of our method is verified by state-of-the-art performance on two context-aware object recognition tasks.
arXiv Detail & Related papers (2021-10-08T11:31:58Z) - Context-Aware Layout to Image Generation with Enhanced Object Appearance [123.62597976732948]
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff)
Existing L2I models have made great progress, but object-to-object and object-to-stuff relations are often broken.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
arXiv Detail & Related papers (2021-03-22T14:43:25Z) - Learning from Context or Names? An Empirical Study on Neural Relation
Extraction [112.06614505580501]
We study the effect of two main information sources in text: textual context and entity mentions (names)
We propose an entity-masked contrastive pre-training framework for relation extraction (RE)
Our framework can improve the effectiveness and robustness of neural models in different RE scenarios.
arXiv Detail & Related papers (2020-10-05T11:21:59Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.