Improving One-stage Visual Grounding by Recursive Sub-query Construction
- URL: http://arxiv.org/abs/2008.01059v1
- Date: Mon, 3 Aug 2020 17:43:30 GMT
- Title: Improving One-stage Visual Grounding by Recursive Sub-query Construction
- Authors: Zhengyuan Yang, Tianlang Chen, Liwei Wang, Jiebo Luo
- Abstract summary: We improve one-stage visual grounding by addressing current limitations on grounding long and complex queries.
We show our new one-stage method obtains 5.0%, 4.5%, 7.5%, 12.8% absolute improvements over the state-of-the-art one-stage baseline.
- Score: 102.47477888060801
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We improve one-stage visual grounding by addressing current limitations on
grounding long and complex queries. Existing one-stage methods encode the
entire language query as a single sentence embedding vector, e.g., taking the
embedding from BERT or the hidden state from LSTM. This single vector
representation is prone to overlooking the detailed descriptions in the query.
To address this query modeling deficiency, we propose a recursive sub-query
construction framework, which reasons between image and query for multiple
rounds and reduces the referring ambiguity step by step. We show our new
one-stage method obtains 5.0%, 4.5%, 7.5%, 12.8% absolute improvements over the
state-of-the-art one-stage baseline on ReferItGame, RefCOCO, RefCOCO+, and
RefCOCOg, respectively. In particular, superior performances on longer and more
complex queries validates the effectiveness of our query modeling.
Related papers
- TreeHop: Generate and Filter Next Query Embeddings Efficiently for Multi-hop Question Answering [27.37434534716611]
TreeHop is an embedding-level framework for multi-hop question answering.
TreeHop dynamically updates query embeddings by fusing semantic information from prior queries.
TreeHop is a faster and more cost-effective solution for deployment in a range of knowledge-intensive applications.
arXiv Detail & Related papers (2025-04-28T01:56:31Z) - MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping [1.1557852082644071]
Few-shot Semantic addresses the challenge of segmenting objects in query images with only a handful of examples.
We propose a new Few-shot Semantic framework based on the transformer architecture.
Our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies.
arXiv Detail & Related papers (2024-09-17T16:14:03Z) - Complete Approximations of Incomplete Queries [0.9626666671366836]
We investigate whether a query can be fully answered, as if all data were available.
If not, we explore reformulating the query into either Maximal Complete approximations (MCSs) or the Minimal Complete Generalization (MCG)
arXiv Detail & Related papers (2024-07-30T16:13:42Z) - LaSagnA: Language-based Segmentation Assistant for Complex Queries [39.620806493454616]
Large Language Models for Vision (vLLMs) generate detailed perceptual outcomes, including bounding boxes and masks.
In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries.
We present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format.
arXiv Detail & Related papers (2024-04-12T14:40:45Z) - See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task.
Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z) - Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution.
We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well.
Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z) - Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES.
Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query.
By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z) - Referring Transformer: A One-step Approach to Multi-task Visual
Grounding [45.42959940733406]
We propose a simple one-stage multi-task framework for visual grounding tasks.
Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder.
We show that our model benefits greatly from contextualized information and multi-task training.
arXiv Detail & Related papers (2021-06-06T10:53:39Z) - Scale-Localized Abstract Reasoning [79.00011351374869]
We consider the abstract relational reasoning task, which is commonly used as an intelligence test.
Since some patterns have spatial rationales, while others are only semantic, we propose a multi-scale architecture that processes each query in multiple resolutions.
We show that indeed different rules are solved by different resolutions and a combined multi-scale approach outperforms the existing state of the art in this task on all benchmarks by 5-54%.
arXiv Detail & Related papers (2020-09-20T10:37:29Z) - Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers.
We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.