Towards Addressing the Misalignment of Object Proposal Evaluation for
Vision-Language Tasks via Semantic Grounding
- URL: http://arxiv.org/abs/2309.00215v1
- Date: Fri, 1 Sep 2023 02:19:41 GMT
- Title: Towards Addressing the Misalignment of Object Proposal Evaluation for
Vision-Language Tasks via Semantic Grounding
- Authors: Joshua Feinglass and Yezhou Yang
- Abstract summary: The performance of object proposals generated for Vision-Language (VL) tasks is currently evaluated across all available annotations.
Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects.
We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation.
- Score: 36.03994217853856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object proposal generation serves as a standard pre-processing step in
Vision-Language (VL) tasks (image captioning, visual question answering, etc.).
The performance of object proposals generated for VL tasks is currently
evaluated across all available annotations, a protocol that we show is
misaligned - higher scores do not necessarily correspond to improved
performance on downstream VL tasks. Our work serves as a study of this
phenomenon and explores the effectiveness of semantic grounding to mitigate its
effects. To this end, we propose evaluating object proposals against only a
subset of available annotations, selected by thresholding an annotation
importance score. Importance of object annotations to VL tasks is quantified by
extracting relevant semantic information from text describing the image. We
show that our method is consistent and demonstrates greatly improved alignment
with annotations selected by image captioning metrics and human annotation when
compared against existing techniques. Lastly, we compare current detectors used
in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an
example of when traditional object proposal evaluation techniques are
misaligned.
Related papers
- Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers [51.260510447308306]
We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
arXiv Detail & Related papers (2023-12-01T19:06:12Z) - Leveraging VLM-Based Pipelines to Annotate 3D Objects [68.51034848207355]
We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response.
Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods.
We show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the 764K dataset.
arXiv Detail & Related papers (2023-11-29T17:54:22Z) - How to Evaluate the Generalization of Detection? A Benchmark for
Comprehensive Open-Vocabulary Detection [25.506346503624894]
We propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge.
The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input.
arXiv Detail & Related papers (2023-08-25T04:54:32Z) - Incremental Image Labeling via Iterative Refinement [4.7590051176368915]
In particular, the existence of the semantic gap problem leads to a many-to-many mapping between the information extracted from an image and its linguistic description.
This unavoidable bias further leads to poor performance on current computer vision tasks.
We introduce a Knowledge Representation (KR)-based methodology to provide guidelines driving the labeling process.
arXiv Detail & Related papers (2023-04-18T13:37:22Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - Exploring Conditional Text Generation for Aspect-Based Sentiment
Analysis [28.766801337922306]
Aspect-based sentiment analysis (ABSA) is an NLP task that entails processing user-generated reviews to determine (i) the target being evaluated, (ii) the aspect category to which it belongs, and (iii) the sentiment expressed towards the target and aspect pair.
We propose transforming ABSA into an abstract summary-like conditional text generation task that uses targets, aspects, and polarities to generate auxiliary statements.
arXiv Detail & Related papers (2021-10-05T20:08:25Z) - Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models [2.1320960069210484]
This work studies multimodal learning in context of visually grounded speech (VGS) models.
We introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words.
We show that cross-modal attention helps the model to achieve higher semantic cross-modal retrieval performance.
arXiv Detail & Related papers (2021-07-05T12:54:05Z) - Aligning Pretraining for Detection via Object-Level Contrastive Learning [57.845286545603415]
Image-level contrastive representation learning has proven to be highly effective as a generic model for transfer learning.
We argue that this could be sub-optimal and thus advocate a design principle which encourages alignment between the self-supervised pretext task and the downstream task.
Our method, called Selective Object COntrastive learning (SoCo), achieves state-of-the-art results for transfer performance on COCO detection.
arXiv Detail & Related papers (2021-06-04T17:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.