YORO -- Lightweight End to End Visual Grounding
- URL: http://arxiv.org/abs/2211.07912v1
- Date: Tue, 15 Nov 2022 05:34:40 GMT
- Title: YORO -- Lightweight End to End Visual Grounding
- Authors: Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno
Vasconcelos
- Abstract summary: YORO is a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task.
It consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object.
YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins.
- Score: 58.17659561501071
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present YORO - a multi-modal transformer encoder-only architecture for the
Visual Grounding (VG) task. This task involves localizing, in an image, an
object referred via natural language. Unlike the recent trend in the literature
of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a
better trade-off between speed an accuracy by embracing a single-stage design,
without CNN backbone. YORO consumes natural language queries, image patches,
and learnable detection tokens and predicts coordinates of the referred object,
using a single transformer encoder. To assist the alignment between text and
visual objects, a novel patch-text alignment loss is proposed. Extensive
experiments are conducted on 5 different datasets with ablations on
architecture design choices. YORO is shown to support real-time inference and
outperform all approaches in this class (single-stage methods) by large
margins. It is also the fastest VG model and achieves the best speed/accuracy
trade-off in the literature.
Related papers
- Real-Time Text Detection with Similar Mask in Traffic, Industrial, and Natural Scenes [31.180352896153682]
We propose an efficient multi-scene text detector that contains an effective text representation similar mask (SM) and a feature correction module (FCM)
To validate the scene of the SM-Net, we conduct experiments on traffic, industrial, and natural scene datasets.
arXiv Detail & Related papers (2024-11-05T04:08:59Z) - FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting [14.054151352916296]
This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer-Decoder architecture.
FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts.
Our results indicate that FastTextSpotter achieves superior accuracy in detecting and recognizing multilingual scene text.
arXiv Detail & Related papers (2024-08-27T12:28:41Z) - IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text Recognition [5.525052547053668]
Scene text recognition has attracted more and more attention due to its diverse applications.
Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right.
We propose an alternative solution, using a parallel and iterative decoder that adopts an easy-first decoding strategy.
arXiv Detail & Related papers (2023-12-19T08:03:19Z) - DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in
Transformer [94.35116535588332]
Transformer-based methods, which predict polygon points or Bezier curve control points to localize texts, are quite popular in scene text detection.
However, the used point label form implies the reading order of humans, which affects the robustness of Transformer model.
We propose DPText-DETR, which directly uses point coordinates as queries and dynamically updates them between decoder layers.
arXiv Detail & Related papers (2022-07-10T15:45:16Z) - Unifying Voxel-based Representation with Transformer for 3D Object
Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR.
The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection.
UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z) - Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing.
The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.