Related papers: Visual Grounding with Transformers

Visual Grounding with Transformers

URL: http://arxiv.org/abs/2105.04281v1
Date: Mon, 10 May 2021 11:46:12 GMT
Title: Visual Grounding with Transformers
Authors: Ye Du, Zehua Fu, Qingjie Liu, Yunhong Wang
Abstract summary: Our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models. Our method outperforms state-of-the-art proposal-free approaches by a considerable margin on five benchmarks.
Score: 43.40192909920495
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose a transformer based approach for visual grounding. Unlike previous proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage detector by fusing textual embeddings, our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models. Termed VGTR -- Visual Grounding with TRansformers, our approach is designed to learn semantic-discriminative visual features under the guidance of the textual description without harming their location ability. This information flow enables our VGTR to have a strong capability in capturing context-level semantics of both vision and language modalities, rendering us to aggregate accurate visual clues implied by the description to locate the interested object instance. Experiments show that our method outperforms state-of-the-art proposal-free approaches by a considerable margin on five benchmarks while maintaining fast inference speed.

Related papers

From Data to Modeling: Fully Open-vocabulary Scene Graph Generation [29.42202665594218]
OvSGTR is a transformer-based framework for fully open-vocabulary scene graph generation.<n>Our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories.
arXiv Detail & Related papers (2025-05-26T15:11:23Z)
Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts [39.92376420375139]
We propose the first framework, named Interpretable Visual Prompt Tuning, to explore interpretability for visual prompts. Visual prompts are linked to human-understandable semantic concepts, represented as a set of category-agnostic prototypes. IVPT aggregates features from these regions to generate interpretable prompts, which are structured hierarchically to explain visual prompts at different granularities.
arXiv Detail & Related papers (2025-03-08T06:12:50Z)
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting [68.37943632270505]
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond categories. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories. We propose an open-vocabulary relationship that leverages the rich semantic knowledge of CLIP to discover novel relationships.
arXiv Detail & Related papers (2024-09-19T06:25:01Z)
OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer [63.141027246418]
We propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. We provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment. Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark.
arXiv Detail & Related papers (2024-07-15T12:15:27Z)
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT) To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling. Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z)
PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model. To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z)
Open-Vocabulary DETR with Conditional Matching [86.1530128487077]
OV-DETR is an open-vocabulary detector based on DETR. It can detect any object given its class name or an exemplar image. It achieves non-trivial improvements over current state of the arts.
arXiv Detail & Related papers (2022-03-22T16:54:52Z)
Transferring Semantic Knowledge Into Language Encoders [6.85316573653194]
We introduce semantic form mid-tuning, an approach for transferring semantic knowledge from semantic meaning representations into language encoders. We show that this alignment can be learned implicitly via classification or directly via triplet loss. Our method yields language encoders that demonstrate improved predictive performance across inference, reading comprehension, textual similarity, and other semantic tasks.
arXiv Detail & Related papers (2021-10-14T14:11:12Z)
TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image. We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.