Related papers: Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

URL: http://arxiv.org/abs/2404.01151v1
Date: Mon, 1 Apr 2024 14:53:36 GMT
Title: Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs
Authors: Jialou Wang, Manli Zhu, Yulei Li, Honglei Li, Longzhi Yang, Wai Lok Woo,
Abstract summary: We introduce an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map.
Score: 5.891295920078768
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Localization plays a crucial role in enhancing the practicality and precision of VQA systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system's ability to provide contextually relevant and spatially accurate responses, crucial for applications in dynamic environments like robotics and augmented reality. However, traditional systems face challenges in accurately mapping objects within images to generate nuanced and spatially aware responses. In this work, we introduce "Detect2Interact", which addresses these challenges by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.

Related papers

Learning to Borrow Features for Improved Detection of Small Objects in Single-Shot Detectors [0.0]
We propose a novel framework that enables small object representations to "borrow" discriminative features from larger, semantically richer instances within the same class. Our approach significantly boosts small object detection accuracy over baseline methods, offering a promising direction for robust object detection in complex visual environments.
arXiv Detail & Related papers (2025-04-30T01:18:33Z)
ContextHOI: Spatial Context Learning for Human-Object Interaction Detection [24.381821663963898]
spatial contexts are considered critical in Human-Object Interaction (HOI) recognition. We present a dual-branch framework named ContextHOI, which efficiently captures both object detection features and spatial contexts. ContextHOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks.
arXiv Detail & Related papers (2024-12-12T08:21:19Z)
Discriminative Spatial-Semantic VOS Solution: 1st Place Solution for 6th LSVOS [68.47681139026666]
Video object segmentation (VOS) is a crucial task in computer vision. Current VOS methods struggle with complex scenes and prolonged object motions. This report introduces a discriminative spatial-temporal VOS model.
arXiv Detail & Related papers (2024-08-29T10:47:17Z)
Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection. The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z)
Exploring Predicate Visual Context in Detecting Human-Object Interactions [44.937383506126274]
We study how best to re-introduce image features via cross-attention. Our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks.
arXiv Detail & Related papers (2023-08-11T15:57:45Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection [4.534713782093219]
A novel end-to-end transformer-based framework (FGAHOI) is proposed to alleviate the above problems. FGAHOI comprises three dedicated components namely, multi-scale sampling (MSS), hierarchical spatial-aware merging (HSAM) and task-aware merging mechanism (TAM)
arXiv Detail & Related papers (2023-01-08T03:53:50Z)
DQnet: Cross-Model Detail Querying for Camouflaged Object Detection [54.82390534024954]
A convolutional neural network (CNN) for camouflaged object detection tends to activate local discriminative regions while ignoring complete object extent. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN. In order to obtain feature maps that could activate full object extent, a novel framework termed Cross-Model Detail Querying network (DQnet) is proposed.
arXiv Detail & Related papers (2022-12-16T06:23:58Z)
Contrastive Object Detection Using Knowledge Graph Embeddings [72.17159795485915]
We compare the error statistics of the class embeddings learned from a one-hot approach with semantically structured embeddings from natural language processing or knowledge graphs. We propose a knowledge-embedded design for keypoint-based and transformer-based object detection architectures.
arXiv Detail & Related papers (2021-12-21T17:10:21Z)
Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics. We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention. We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z)
Visual Object Recognition in Indoor Environments Using Topologically Persistent Features [2.2344764434954256]
Object recognition in unseen indoor environments remains a challenging problem for visual perception of mobile robots. We propose the use of topologically persistent features, which rely on the objects' shape information, to address this challenge. We implement the proposed method on a real-world robot to demonstrate its usefulness.
arXiv Detail & Related papers (2020-10-07T06:04:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.